Data Science Project - Malicious Websites Detection based on Url Characteristics¶


Name: Guy Goren¶

ID: 208647859¶


¶

In [2]:
plt.figure(figsize=(12,9))
img = mpimg.imread('malicious.png')
imgplot = plt.imshow(img)
plt.show()
No description has been provided for this image

¶

Introduction¶

The internet plays a crucial role in our daily lives, offering a wealth of resources and opportunities, but it also brings challenges, such as the risk of encountering malicious websites.

A URL (Uniform Resource Locator) is essentially the address used to access resources on the internet. It provides information about the location and the method of retrieving web content.

Components of a URL¶

URLs typically consist of several key components:

  • Port (HTTP or HTTPS): Specifies how the browser should communicate with the server.
  • Domain name: Represents the web server or website that hosts the resource.
  • Path: Identifies the specific resource or webpage on the server.

Additionally, URLs may include:

  • Query parameters: Optional data passed to the server to modify or filter the response (search terms).
  • Fragment identifiers: Direct users to a specific section of a webpage.

Malicious URLs¶

Malicious URLs are web addresses created with harmful intent. They are designed to deceive users and carry out various malicious activities. These include:

  • Benign URLs: Legitimate web addresses that pose no threat to users. While they serve their intended purpose without malicious intent, many malicious URLs are designed to mimic benign ones to mislead users into thinking they are visiting a safe site.

  • Phishing: These URLs are crafted to resemble legitimate websites (such as banks or popular services) to trick users into entering sensitive information like usernames, passwords, or credit card details. Phishing URLs often lead to fraudulent login pages that capture user data.

  • Defacement: Some malicious URLs lead to websites that have been defaced — where content has been altered or vandalized. Hackers may display unauthorized messages, disrupt services, or damage an organization’s credibility.

  • Malware: Other malicious URLs are designed to deliver malware, including viruses, ransomware, or spyware. When users click on these URLs, they may unknowingly download harmful software that can compromise their system, steal data, or hold their files hostage.

Although malicious URLs may appear legitimate at first glance, they often conceal attacks that can lead to data breaches, financial loss, or system compromise. As cyber threats continue to evolve, detecting and mitigating these risks is a critical area of focus in cybersecurity.

In [5]:
plt.figure(figsize=(12,9))
img = mpimg.imread('URL1.png')
imgplot = plt.imshow(img)
plt.show()
No description has been provided for this image

Motivation¶

With the exponential growth of internet users and web services, malicious URLs have become a primary vector for cyberattacks. These attacks range from phishing schemes that trick users into revealing personal information to malware that silently infects devices. The widespread impact of malicious websites is extensive, including:

  • Identity theft
  • Data loss
  • Financial fraud
  • Damage to both individuals and organizations

The rapid growth of the internet has made it an indispensable part of daily life, but it has also become a breeding ground for cyber threats. Malicious websites are a key component of these threats, as cybercriminals continually create harmful URLs to exploit unsuspecting users.

Each year, more than 10 million new malicious websites are created globally, aimed at:

  • Delivering malware
  • Stealing personal data
  • Launching phishing attacks

According to a 2023 report from Symantec, over 1.5 million new phishing sites are created each month. The FBI’s 2022 Internet Crime Report notes that cybercrime cost Americans over $10 billion in a single year, with phishing and malware being major contributors to these losses.

Increasing Sophistication of Malicious Actors¶

Malicious actors are becoming more sophisticated, using advanced techniques to disguise harmful URLs. This makes traditional blacklist-based detection methods struggle to keep up with the speed and volume of emerging threats.

Manual detection is no longer practical due to the vast number of websites. Therefore, automation and scalability have become crucial in addressing this issue.

The Role of Machine Learning¶

Machine learning offers an effective solution for detecting malicious URLs. By analyzing URL patterns and structures, machine learning models can help improve both the efficiency and accuracy of malicious URL detection.

Given these alarming trends, developing a robust, automated system for detecting malicious websites based on URL characteristics is essential to:

  • Enhance cybersecurity measures
  • Reduce potential harm
  • Stay ahead of evolving cyber threats

threats.orbes

Goals¶

The primary goal of this project is to develop an effective system for detecting malicious URLs based on their characteristics. This involves:

  • Feature Extraction: Identifying and extracting relevant features from URLs that may indicate malicious intent, such as unusual patterns, length, and the presence of suspicious keywords.
  • Machine Learning Models: Implementing machine learning algorithms to classify URLs as benign or malicious based on the extracted features.
  • Real-Time Detection: Creating a system capable of analyzing URLs in real-time to provide immediate feedback and protection to users.
  • Improving Detection Rates: Aiming to enhance the accuracy and efficiency of existing detection methods to reduce false positives and false negatives.

Challenges and Struggles in Detecting Malicious URLs¶

Detecting malicious URLs is a complex and evolving challenge due to several factors. As attackers continuously refine their techniques, security systems must also adapt. Below are some of the key challenges faced in detecting malicious URLs:

1. Evolving Techniques of Attackers¶

Cybercriminals are constantly developing new ways to disguise their malicious URLs. This includes:

  • Obfuscation Techniques: Attackers often obfuscate URLs using URL shortening services, encoding characters, or implementing multiple redirects to hide the actual destination.
  • Domain Generation Algorithms (DGA): These algorithms are used to create large numbers of domain names that can be used in attacks, making it difficult for traditional detection methods to keep up.

2. Volume of Data¶

The sheer volume of web traffic makes manual detection impractical. Millions of new URLs are generated daily, requiring automated systems to effectively analyze and identify malicious ones.

3. Sophistication of Malicious URLs¶

Malicious URLs may closely resemble benign ones, making it challenging to differentiate between safe and harmful links. This includes:

  • Typosquatting: Attackers create URLs that are similar to legitimate ones but contain slight misspellings, tricking users into visiting harmful sites.
  • Phishing Pages: Phishing URLs often use similar branding or design as legitimate sites, making detection difficult.

4. False Positives and Negatives¶

Balancing detection accuracy is crucial:

  • False Positives: Legitimate URLs may be flagged as malicious, causing inconvenience for users and potentially damaging trust.
  • False Negatives: Malicious URLs that go undetected can lead to significant harm, such as data breaches or financial losses.

  • Given the potential consequences, we are particularly vigilant about minimizing false negatives. Allowing a malicious URL to evade detection poses a far greater risk than mistakenly flagging a benign URL. Thus, ensuring accurate detection of malicious URLs is paramount to safeguarding user safety and maintaining trust in our cybersecurity measures.

5. Dynamic and Contextual Nature of URLs¶

The context in which a URL is used can affect its maliciousness. URLs may appear benign in one scenario but could be harmful in another, requiring systems to analyze contextual data effectively.

Problem Statement¶

In this case study, we address the detection of malicious URLs as a multi-class classification problem. Our objective is to classify raw URLs into different categories, including:

  • Benign or Safe URLs: Legitimate web addresses that pose no threat to users.
  • Phishing URLs: URLs designed to deceive users into providing sensitive information by mimicking legitimate sites.
  • Malware URLs: URLs that deliver harmful software to users' devices.
  • Defacement URLs: URLs that lead to altered web pages with unauthorized content.

By accurately classifying these URL types, we aim to enhance cybersecurity measures and provide users with better protection against various online threats.

Project Workflow¶

1. Data Preprocessing¶

  • Load Data: Import the dataset and inspect its structure.
  • Check for Null and Duplicate Values: Identify and remove any missing or duplicated data to ensure dataset quality.

2. Feature Extraction¶

  • Extract key features from the URLs, such as:
    • Length of the URL.
    • Number of special characters.
    • Presence of suspicious keywords or domains.
    • Tokenizing the URL and creating n-grams.

3. Outlier Handling¶

  • Identify and remove extreme outliers using statistical methods (Z-scores or IQR).

4. Train-Test Split¶

  • Split the dataset into training and testing sets (80/20 split).

5. Exploratory Data Analysis (EDA)¶

  • Target Distribution: Visualize the distribution of the target class (malicious vs non-malicious).
  • Feature Distribution: Analyze the distributions of key features in the training data.

6. Feature Engineering¶

  • WordCloud for Feature Extraction: Generate WordClouds to extract new features from URLs based on frequently occurring words/phrases.
  • EDA for Features: Perform detailed analysis on the extracted features.

7. Feature Selection¶

  • Statistical Tests:
    • Use f_classif to assess the importance of numerical features.
    • Apply Chi-Square tests for binary categorical features.
  • Feature Selection Techniques:
    • Use Mutual Information for selecting informative features.
    • Apply BorutaPy for selecting features based on their relevance.

8. Model Building¶

  • Train 3 Types of Models:
    • XGBoost (XGB), LightGBM (LGBM), and CatBoost.
    • For each model, use 3 different sets of features:
      1. All Features.
      2. Mutual Information (MI) Selected Features.
      3. BorutaPy Selected Features.
  • Optimization: Use Optuna for hyperparameter optimization across all models, resulting in a total of 9 model configurations.

9. Model Evaluation¶

  • Compare the performance of all 9 models using metrics such as accuracy, balanced accuracy, precision, recall and F1 score.
  • Identify the best-performing models based on evaluation metrics.

10. Model Interpretability¶

  • Use SHAP (SHapley Additive exPlanations) to interpret the feature importance and explain the predictions of the best-performing models.

11. Deep Learning Models¶

  • Artificial Neural Networks (ANN):
    • Doing manual hyperpamaters
    • Train an FNN model for URL classification.
    • Model evaluation.
  • BERT:
    • Use BERT (Bidirectional Encoder Representations from Transformers) to classify URLs based on text embeddings.
    • Train FNN model.
    • Model evaluation.

Dataset Description¶

In this case study, we will be using a Malicious URLs dataset consisting of 651,191 URLs, categorized as follows:

  • 428,103 Benign or Safe URLs
  • 96,457 Defacement URLs
  • 94,111 Phishing URLs
  • 32,520 Malware URLs

Now, let’s discuss the different types of URLs in our dataset: Benign, Malware, Phishing, and Defacement URLs.

Benign URLs¶

These are safe to browse URLs. Some examples of benign URLs include:

  • mp3raid.com/music/krizz_kaliko.html
  • infinitysw.com
  • google.co.in
  • myspace.com

Malware URLs¶

These types of URLs inject malware into the victim’s system once they visit such URLs. Some examples of malware URLs include:

  • proplast.co.nz
  • http://103.112.226.142:36308/Mozi.m
  • microencapsulation.readmyweather.com
  • xo3fhvm5lcvzy92q.download

Defacement URLs¶

Defacement URLs are typically created by hackers with the intention of breaking into a web server and replacing the hosted website with one of their own, using techniques such as code injection or cross-site scripting. Common targets of defacement URLs include religious, government, bank, and corporate websites. Some examples of defacement URLs include:

  • http://www.vnic.co/khach-hang.html
  • http://www.raci.it/component/user/reset.html
  • http://www.approvi.com.br/ck.htm
  • http://www.juventudelirica.com.br/index.html

Phishing URLs¶

Phishing URLs are created by hackers to steal sensitive personal or financial information such as login credentials, credit card numbers, and internet banking details. Some examples of phishing URLs are:

  • roverslands.net
  • corporacionrossenditotours.com
  • http://drive-google-com.fanalav.com/6a7ec96d6a
  • citiprepaid-salarysea-at.tk

Importing Libraries¶

In [3]:
import pandas as pd
import numpy  as np
import scipy.stats as stats
import math
import time

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.ticker as ticker

import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
%matplotlib inline


from typing import Optional, Callable, Union, Any, Tuple, List
import re

from urllib.parse import urlparse
import tldextract

from sklearn.utils import shuffle, compute_sample_weight
from sklearn.feature_selection import mutual_info_classif, f_classif, SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import xgboost as xgb
import lightgbm as lgb
import catboost as cat
from sklearn.ensemble import RandomForestClassifier
import optuna

import time

from boruta import BorutaPy
from wordcloud import WordCloud
import shap

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', message="No further splits with positive gain")

Functions¶

In [4]:
def bar_plot(data: pd.DataFrame, x: str, y: str, hue: Optional[str] 

             ,title = Optional[str], xlabel = Optional[str], ylabel = Optional[str]) ->None:
    
    plt.figure(figsize=(10, 6))
    plt.title(title)
    sns.barplot(x=x, y=y, hue=hue, data=data,legend=True)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)  
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()    


def plot_corr_matrix(df) -> None:
    '''Plots formatted correlation matrix for the supplied df.'''
    fig = plt.figure()
    fig.set_size_inches(10,8)
    ax = fig.add_subplot()

    corr_mat = df.corr()
    mask = np.triu(corr_mat)
    cmap = sns.diverging_palette(220, 20, as_cmap=True)
    sns.heatmap(corr_mat, square=True, mask=mask, cmap=cmap, 
                annot=True, fmt='.2f', vmin=-1, vmax=1,
                annot_kws={'fontsize':'small'},
                ax=ax);
    ax.set_title('Pearson correlation heatmap');

def count_https(url: str) -> int:
    return url.count('https')
    
def count_http(url: str) -> int:
    return url.count('http')

def having_ip_address(url: str) -> int:
    pattern = (
        r'(([01]?\d\d?|2[0-4]\d|25[0-5])\.)'   # First part of IPv4
        r'([01]?\d\d?|2[0-4]\d|25[0-5])\.'     # Second part of IPv4
        r'([01]?\d\d?|2[0-4]\d|25[0-5])\.'     # Third part of IPv4
        r'([01]?\d\d?|2[0-4]\d|25[0-5])|'      # Fourth part of IPv4
        r'((?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4})'  # IPv6
    )
    match = re.search(pattern, url)    
    if match:
        return 1
    else:
        return 0

        
def abnormal_url(url: str) -> int:
    hostname = urlparse(url).hostname
    if hostname and re.search(re.escape(hostname), url):
        return 1
    return 0
    
def has_subdomain(url: str) -> int:
    extracted_info = tldextract.extract(url)
    return int( bool(extracted_info.subdomain))

def extract_tld(url):
    extracted_info = tldextract.extract(url)
    return extracted_info.suffix
    
def is_risky_tld(url: str) -> int:
    tld = extract_tld(url)
    risky_tlds = {
        'ru', 'cn', 'tk', 'ml', 'ga', 'cf', 'gq', 'work', 'xyz', 'top',
        'club', 'men', 'biz', 'info', 'pw', 'cc', 'in', 'us', 'eu', 'co'
    }
    return tld.lower() in risky_tlds

def is_suspicious_suffix(url: str) -> int:
    suspicious_file_extensions = ('.exe', '.php', '.js', '.zip', '.cgi', '.asp', '.aspx')
    return int( any(url.lower().endswith(ext) for ext in suspicious_file_extensions) )



def has_shortening_service(url: str) -> int:
    pattern = re.compile(r'bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                         r'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                         r'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                         r'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                         r'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                         r'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                         r'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                         r'tr\.im|link\.zip\.net', re.IGNORECASE)
    match = pattern.search(url)
    return int(bool(match))


def contains_suspicious_word(url: str) -> int:
    pattern = re.compile(r'PayPal|login|signin|bank|account|update|free|lucky|service|bonus|ebayisapi|webscr', re.IGNORECASE)
    match = pattern.search(url)
    return 1 if match else 0

    
def longest_digit_sequence(url: str) -> int:
    return max(map(len, re.findall(r'\d+', url)), default=0)

def contains_non_ascii(url: str) -> int:
    return int( any(ord(char) > 127 for char in url) )

def has_port_number(url: str) -> int:
    parsed_url = urlparse(url)
    return int( bool(parsed_url.port) )


def count_alpha(url: str) -> int:
    alpha = 0
    for i in url:
        if i.isalpha():
            alpha += 1
    return alpha

def count_digits(url: str) -> int:
    digits = 0
    for i in url:
        if i.isnumeric():
            digits += 1
    return digits

def count_hexadecimal_chars(url: str) -> int:
    return sum(1 for match in re.findall(r'%[0-9A-Fa-f]{2}', url))


def count_dot(url: str) -> int:
    return url.count('.')

def count_www(url: str) -> int:
    return url.count('www')

def count_atrate(url: str) -> int:
    return url.count('@')

def count_per(url: str) -> int:
    return url.count('%')

def count_ques(url: str) -> int:
    return url.count('?')

def count_hyphen(url: str) -> int:
    return url.count('-')

def count_equal(url: str) -> int:
    return url.count('=')

def count_slashes(url: str) -> int:
    return url.count('/')

def count_double_slashes(url: str) -> int:
    return url.count('//')

def sum_special_chars(url: str) -> int:
    special_chars = "!#$%&()*+,/:;<=>?@[\\]^_`{|}~"
    return sum(1 for char in url if char in special_chars)
    
def count_parameters(url: str) -> int:
    parsed_url = urlparse(url)
    return len(parsed_url.query.split('&')) if parsed_url.query else 0

def count_repeated_char(url: str) -> int:
    return max([url.count(char) for char in set(url)])

def count_subdomains(url: str) -> int:
    extracted_info = tldextract.extract(url)
    subdomain = extracted_info.subdomain
    return len(subdomain.split('.')) if subdomain else 0

def number_of_directories(url: str) -> int:
    urldir = urlparse(url).path
    return urldir.count('/')

def number_of_embedded(url: str) -> int:
    urldir = urlparse(url).path
    return urldir.count('//')
  
def get_url_length(url: str) -> int:
    return len(url)

def get_domain_length(url: str) -> int:
    parsed_url = urlparse(url)
    domain = parsed_url.netloc or parsed_url.path.split('/')[0]
    return len(domain.split(':')[0])

def get_path_length(url: str) -> int:
    urlpath = urlparse(url).path
    path_segments = urlpath.strip('/').split('/')
    
    if path_segments:
        return len(path_segments[0])
    else:
        return 0
    
def first_directory_length(url: str) -> int:
    urlpath= urlparse(url).path
    try:
        return len(urlpath.split('/')[1])
    except:
        return 0

def check_accuracy(true_labels, predicted_labels, metric: str) -> None:
    conf_matrix = confusion_matrix(true_labels, predicted_labels)
    accuracy = accuracy_score(true_labels, predicted_labels)
    balanced_accuracy = balanced_accuracy_score(true_labels, predicted_labels)
    precision = precision_score(true_labels, predicted_labels, average='macro')
    recall = recall_score(true_labels, predicted_labels, average='macro')
    f1 = f1_score(true_labels, predicted_labels, average='macro')
    
    metrics_text = (f"Accuracy: {accuracy:.2f}\n"
                    f"Balanced Accuracy: {balanced_accuracy:.2f}\n"
                    f"Precision: {precision:.2f}\n"
                    f"Recall: {recall:.2f}\n"
                    f"F1-Score: {f1:.2f}")

    fig, ax = plt.subplots(1, 2, figsize=(12, 6)) 
    
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=True,
                yticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'],
                xticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'], ax=ax[0])
    ax[0].set_title(f'Confusion Matrix for {metric} Model (Counts)')
    ax[0].set_xlabel('Predicted Labels')
    ax[0].set_ylabel('True Labels')

    conf_matrix_normalized = conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis]
    
    sns.heatmap(conf_matrix_normalized, annot=True, fmt='.2%', cmap='Blues', cbar=True,
                yticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'],
                xticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'], ax=ax[1])
    ax[1].set_title(f'Confusion Matrix for {metric} Model (Percentage)')
    ax[1].set_xlabel('Predicted Labels')
    ax[1].set_ylabel('True Labels')
    
    fig.text(0.5, -0.05, metrics_text, ha='center', va='center', fontsize=12)
    
    plt.tight_layout()
    plt.show()

Config¶

In [5]:
DIRECTORY_PATH = 'C:\\Users\\גיא\\OneDrive\\שולחן העבודה\\סדנה במדעי הנתונים\\malicious'
FILE_NAME = 'malicious_phish.csv'
TARGET = 'type'
SPLIT = 0.8
RANDOM_STATE = 2024
np.random.seed(RANDOM_STATE)

color_mapping = {
    'benign': '#66c2a5',  
    'phishing': '#fc8d62',
    'defacement': '#ffd92f',
    'malware': '#b3b3b3',
}

labelsize = 8
fontsize = 10

categorical_columns = [ 'having_ip_address' ,'is_abnormal_url' ,'has_subdomain' ,'is_risky_tld',
'is_suspicious_suffix' ,'has_shortening_service' ,'has_port_number', 'contains_non_ascii', 'contains_suspicious_word']

numerical_columns = ['longest_digit_sequence', 'count_https', 'count_http', 'count_alpha', 'count_digits', 'count_hex_char',
                     'count_dot', 'count_@', 'count_%', 'count_?', 'count_-', 'count_=', 'count_/',
                     'count_//', 'sum_special_chars', 'count_parameters', 'count_repeated_char',
                     'count_subdomain', 'number_of_directories', 'number_of_embedded', 'url_length',
                     'domain_length', 'path_length', 'tld_length', 'first_directory_length',
                     'alpha_char_ratio', 'digit_char_ratio', 'special_char_ratio']

Loading Dataseta & Data description¶

In [6]:
raw_data = pd.read_csv(DIRECTORY_PATH+'\\'+FILE_NAME)
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651191 entries, 0 to 651190
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     651191 non-null  object
 1   type    651191 non-null  object
dtypes: object(2)
memory usage: 9.9+ MB
In [7]:
raw_data.isna().sum() 
Out[7]:
url     0
type    0
dtype: int64
In [8]:
total_clients  = len(raw_data.url)
unique_clients = len(raw_data.url.unique())

plt.figure(figsize=(8, 8))
plt.pie([unique_clients, total_clients - unique_clients], labels=['Unique Clients', 'Duplicate Clients'], 
        autopct='%1.1f%%', colors=['#ff9999','#66b3ff'], startangle=140, textprops={'fontsize': 14})
plt.title('Proportion of Unique vs. Duplicate Urls', fontsize=16)
plt.show()

# Remove duplicated url
df = raw_data[~raw_data.url.duplicated()].copy()
No description has been provided for this image

Feature Engineering: Extracting Features from URLs¶

Feature engineering is a crucial step in building models for URL classification. By extracting relevant features from URLs, we can enhance the model's ability to differentiate between benign and malicious URLs. Various types of features can be extracted, including lexical features, structural features, and semantic features.

Types of Features¶

  1. Lexical Features: These features relate to the composition of the URL string itself. They include character counts, ratios, and the presence of specific characters or patterns.

    • count_http: Counts the occurrences of the http protocol in the URL. A higher count may indicate a preference for non-secure connections.
    • count_https: Counts the occurrences of the https protocol in the URL. More secure URLs may have a higher count, but attackers can also use HTTPS to disguise malicious intent.
    • count_alpha: Total number of alphabetic characters in the URL. Higher counts might indicate more complex URLs, which are often used in phishing.
    • count_digits: Total number of numeric characters in the URL. Attackers may use numbers to obscure the URL's intent.
    • count_hex_char: Count of hexadecimal characters present in the URL. This can indicate obfuscation techniques.
    • count_dot: Number of dots (.) in the URL. Attackers may use multiple dots to create subdomains or mislead users.
    • count_@: Count of the @ symbol in the URL, often used in email addresses or to obscure the true domain.
    • count_%: Number of percent signs (%) in the URL, commonly used in encoding, which can signal attempts to disguise content.
    • count_?: Count of question marks (?), indicating query parameters that can carry additional malicious payloads.
    • count_-: Count of hyphens (-), often used in deceptive URL formations.
    • count_=: Number of equal signs (=) in the URL, often seen in query strings.
    • count_/: Count of slashes (/), which can indicate nested paths or complexity in URL structure.
    • count_//: Count of double slashes (//), often indicating the start of a resource in the URL.
    • sum_special_chars: Total count of all special characters in the URL. A higher number can indicate an attempt to obfuscate the URL.
    • count_parameters: Number of parameters in the URL. URLs with many parameters may be more likely to contain malicious content.
    • count_repeated_char: Count of characters that are repeated consecutively, which can be a tactic used in phishing attempts.
    • count_subdomain: Count of subdomains in the URL. Malicious URLs often use multiple subdomains to confuse users.
  2. Structural Features: These features describe the structure and components of the URL.

    • having_ip_address: Checks whether the URL contains an IP address instead of a domain name. Cyber attackers often use IP addresses to hide their identity.
    • is_abnormal_url: Identifies URLs that exhibit characteristics that deviate from typical patterns, potentially indicating malicious intent.
    • has_subdomain: Indicates the presence of a subdomain in the URL. Malicious URLs often utilize subdomains to mislead users.
    • is_risky_tld: Flags if the URL has a top-level domain associated with higher risks, such as .xyz or .info.
    • is_suspicious_suffix: Flags URLs with suffixes that are commonly used in phishing or malicious URLs.
    • has_shortening_service: Indicates if the URL uses a URL shortening service (bit.ly), which can obscure the true destination.
    • longest_digit_sequence: The length of the longest sequence of digits found in the URL. Longer sequences may indicate attempts to obfuscate.
    • contains_non_ascii: Flags if the URL contains non-ASCII characters, which can be used in obfuscation tactics.
    • has_port_number: Indicates if the URL specifies a port number, which may be uncommon for benign URLs.
  3. Semantic Features: These features capture the meaning behind certain components of the URL.

    • url_length: Total length of the URL. Attackers often use longer URLs to hide the domain name and mislead users.
    • domain_length: Length of the domain part of the URL. Short or overly complex domains can signal potential threats.
    • path_length: Length of the path in the URL. Longer paths can indicate attempts to confuse users.
    • tld_length: Length of the top-level domain. Longer TLDs may be used to disguise malicious intent.
    • first_directory_length: Length of the first directory in the URL path. Short or suspicious first directories can indicate potential threats.
    • contains_suspicious_word: Flags the presence of suspicious words in the URL, which are often associated with phishing.
    • alpha_char_ratio: Ratio of alphabetic characters to total characters, providing insight into the URL's composition.
    • digit_char_ratio: Ratio of numeric characters to total characters, highlighting potential obfuscation.
    • special_char_ratio: Ratio of special characters to total characters, indicating complexity and potential risk.
In [9]:
df['having_ip_address'] = df['url'].apply(lambda i: having_ip_address(i))

df['is_abnormal_url']  = df['url'].apply(lambda x: abnormal_url(x))
df['has_subdomain'] = df['url'].apply(has_subdomain)

df['is_risky_tld']           = df['url'].apply(lambda x: is_risky_tld(x))
df['is_suspicious_suffix']   = df['url'].apply(lambda x: is_suspicious_suffix(x))
df['has_shortening_service'] = df['url'].apply(lambda x: has_shortening_service(x))

df['longest_digit_sequence'] = df['url'].apply(lambda x: longest_digit_sequence(x))
df['contains_non_ascii']     = df['url'].apply(lambda x: contains_non_ascii(x))
df['has_port_number']        = df['url'].apply(lambda x: has_port_number(x))

df['count_http']     = df['url'].apply(lambda x: count_http(x))
df['count_https']    = df['url'].apply(lambda x: count_https(x))
df['count_alpha']    = df['url'].apply(lambda x: count_alpha(x))
df['count_digits']   = df['url'].apply(lambda x: count_digits(x))
df['count_hex_char'] = df['url'].apply(lambda x: count_hexadecimal_chars(x))
df['count_dot']      = df['url'].apply(lambda x: count_dot(x))
df['count_@']        = df['url'].apply(lambda x: count_atrate(x))
df['count_%']        = df['url'].apply(lambda x: count_per(x))
df['count_?']        = df['url'].apply(lambda x: count_ques(x))
df['count_-']        = df['url'].apply(lambda x: count_hyphen(x))
df['count_=']        = df['url'].apply(lambda x: count_equal(x))
df['count_/']        = df['url'].apply(lambda x: count_slashes(x))
df['count_//']       = df['url'].apply(lambda x: count_double_slashes(x))

df["sum_special_chars"]   = df['url'].apply(lambda x: sum_special_chars(x))
df['count_parameters']    = df['url'].apply(lambda x: count_parameters(x))
df['count_repeated_char'] = df['url'].apply(lambda x: count_repeated_char(x))
df['count_subdomain']     = df['url'].apply(lambda x: count_subdomains(x))

df['number_of_directories'] = df['url'].apply(lambda x: number_of_directories(x))
df['number_of_embedded']    = df['url'].apply(lambda x: number_of_embedded(x))

df['url_length']          = df['url'].apply(get_url_length)
df["domain_length"]       = df["url"].apply(get_domain_length)
df['path_length']         = df['url'].apply(get_path_length)
df['tld_length']          = df['url'].apply(lambda x: len(extract_tld(x)))


df['first_directory_length']   = df['url'].apply(lambda x: first_directory_length(x))
df['contains_suspicious_word'] = df['url'].apply(lambda x: contains_suspicious_word(x))

df['alpha_char_ratio']      = df["count_alpha"]       / df["url_length"] 
df['digit_char_ratio']      = df["count_digits"]      / df["url_length"] 
df['special_char_ratio']    = df["sum_special_chars"] / df["url_length"] 


df.set_index('url', inplace=True)    
df.shape
Out[9]:
(641119, 38)
In [10]:
df.tail(10)
Out[10]:
type having_ip_address is_abnormal_url has_subdomain is_risky_tld is_suspicious_suffix has_shortening_service longest_digit_sequence contains_non_ascii has_port_number ... number_of_embedded url_length domain_length path_length tld_length first_directory_length contains_suspicious_word alpha_char_ratio digit_char_ratio special_char_ratio
url
www.1up.com/do/gameOverview?cId=3159391 phishing 0 0 1 False 0 0 7 0 0 ... 0 39 11 11 3 2 0 0.641026 0.205128 0.102564
psx.ign.com/articles/131/131835p1.html phishing 0 0 1 False 0 0 6 0 0 ... 0 38 11 11 3 8 0 0.578947 0.263158 0.078947
wii.gamespy.com/wii/cursed-mountain/ phishing 0 0 1 False 0 0 0 0 0 ... 0 36 15 15 3 3 0 0.833333 0.000000 0.083333
wii.ign.com/objects/142/14270799.html phishing 0 0 1 False 0 0 8 0 0 ... 0 37 11 11 3 7 0 0.540541 0.297297 0.081081
xbox360.gamespy.com/xbox-360/dead-space/ phishing 0 0 1 False 0 0 3 0 0 ... 0 40 19 19 3 8 0 0.675000 0.150000 0.075000
xbox360.ign.com/objects/850/850402.html phishing 0 0 1 False 0 0 6 0 0 ... 0 39 15 15 3 7 0 0.538462 0.307692 0.076923
games.teamxbox.com/xbox-360/1860/Dead-Space/ phishing 0 0 1 False 0 1 4 0 0 ... 0 44 18 18 3 8 0 0.659091 0.159091 0.090909
www.gamespot.com/xbox360/action/deadspace/ phishing 0 0 1 False 0 1 3 0 0 ... 0 42 16 16 3 7 0 0.785714 0.071429 0.095238
en.wikipedia.org/wiki/Dead_Space_(video_game) phishing 0 0 1 False 0 0 0 0 0 ... 0 45 16 16 3 4 0 0.800000 0.000000 0.155556
www.angelfire.com/goth/devilmaycrytonite/ phishing 0 0 1 False 0 0 0 0 0 ... 0 41 17 17 3 4 0 0.878049 0.000000 0.073171

10 rows × 38 columns

In [11]:
df[df.select_dtypes(include=[bool]).columns] = df.select_dtypes(include=[bool]).astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 641119 entries, br-icloud.com.br to www.angelfire.com/goth/devilmaycrytonite/
Data columns (total 38 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   type                      641119 non-null  object 
 1   having_ip_address         641119 non-null  int64  
 2   is_abnormal_url           641119 non-null  int64  
 3   has_subdomain             641119 non-null  int64  
 4   is_risky_tld              641119 non-null  int32  
 5   is_suspicious_suffix      641119 non-null  int64  
 6   has_shortening_service    641119 non-null  int64  
 7   longest_digit_sequence    641119 non-null  int64  
 8   contains_non_ascii        641119 non-null  int64  
 9   has_port_number           641119 non-null  int64  
 10  count_http                641119 non-null  int64  
 11  count_https               641119 non-null  int64  
 12  count_alpha               641119 non-null  int64  
 13  count_digits              641119 non-null  int64  
 14  count_hex_char            641119 non-null  int64  
 15  count_dot                 641119 non-null  int64  
 16  count_@                   641119 non-null  int64  
 17  count_%                   641119 non-null  int64  
 18  count_?                   641119 non-null  int64  
 19  count_-                   641119 non-null  int64  
 20  count_=                   641119 non-null  int64  
 21  count_/                   641119 non-null  int64  
 22  count_//                  641119 non-null  int64  
 23  sum_special_chars         641119 non-null  int64  
 24  count_parameters          641119 non-null  int64  
 25  count_repeated_char       641119 non-null  int64  
 26  count_subdomain           641119 non-null  int64  
 27  number_of_directories     641119 non-null  int64  
 28  number_of_embedded        641119 non-null  int64  
 29  url_length                641119 non-null  int64  
 30  domain_length             641119 non-null  int64  
 31  path_length               641119 non-null  int64  
 32  tld_length                641119 non-null  int64  
 33  first_directory_length    641119 non-null  int64  
 34  contains_suspicious_word  641119 non-null  int64  
 35  alpha_char_ratio          641119 non-null  float64
 36  digit_char_ratio          641119 non-null  float64
 37  special_char_ratio        641119 non-null  float64
dtypes: float64(3), int32(1), int64(33), object(1)
memory usage: 188.3+ MB
In [12]:
df.select_dtypes(include=['O']).columns 
Out[12]:
Index(['type'], dtype='object')
In [13]:
df.describe().T
Out[13]:
count mean std min 25% 50% 75% max
having_ip_address 641119.0 0.019461 0.138140 0.0 0.000000 0.000000 0.000000 1.000000
is_abnormal_url 641119.0 0.277557 0.447794 0.0 0.000000 0.000000 1.000000 1.000000
has_subdomain 641119.0 0.390959 0.487966 0.0 0.000000 0.000000 1.000000 1.000000
is_risky_tld 641119.0 0.032602 0.177594 0.0 0.000000 0.000000 0.000000 1.000000
is_suspicious_suffix 641119.0 0.045665 0.208759 0.0 0.000000 0.000000 0.000000 1.000000
has_shortening_service 641119.0 0.061483 0.240215 0.0 0.000000 0.000000 0.000000 1.000000
longest_digit_sequence 641119.0 2.452490 3.280998 0.0 0.000000 1.000000 4.000000 133.000000
contains_non_ascii 641119.0 0.001427 0.037751 0.0 0.000000 0.000000 0.000000 1.000000
has_port_number 641119.0 0.007722 0.087537 0.0 0.000000 0.000000 0.000000 1.000000
count_http 641119.0 0.285666 0.466025 0.0 0.000000 0.000000 1.000000 9.000000
count_https 641119.0 0.025992 0.162631 0.0 0.000000 0.000000 0.000000 5.000000
count_alpha 641119.0 45.179165 31.735030 0.0 25.000000 37.000000 58.000000 2141.000000
count_digits 641119.0 5.371986 11.630365 0.0 0.000000 2.000000 6.000000 1204.000000
count_hex_char 641119.0 0.397326 4.165907 0.0 0.000000 0.000000 0.000000 231.000000
count_dot 641119.0 2.193950 1.491449 0.0 1.000000 2.000000 3.000000 42.000000
count_@ 641119.0 0.002243 0.054507 0.0 0.000000 0.000000 0.000000 10.000000
count_% 641119.0 0.398489 4.166377 0.0 0.000000 0.000000 0.000000 231.000000
count_? 641119.0 0.221391 0.440003 0.0 0.000000 0.000000 0.000000 20.000000
count_- 641119.0 1.561364 2.984744 0.0 0.000000 0.000000 2.000000 87.000000
count_= 641119.0 0.591642 1.491306 0.0 0.000000 0.000000 0.000000 51.000000
count_/ 641119.0 2.921902 1.895781 0.0 2.000000 3.000000 4.000000 41.000000
count_// 641119.0 0.281310 0.456609 0.0 0.000000 0.000000 1.000000 9.000000
sum_special_chars 641119.0 5.436424 6.528653 0.0 2.000000 4.000000 6.000000 367.000000
count_parameters 641119.0 0.578999 1.471207 0.0 0.000000 0.000000 0.000000 51.000000
count_repeated_char 641119.0 6.565156 5.371619 1.0 4.000000 5.000000 8.000000 588.000000
count_subdomain 641119.0 0.496700 1.007248 0.0 0.000000 0.000000 1.000000 33.000000
number_of_directories 641119.0 2.310321 1.566776 0.0 1.000000 2.000000 3.000000 39.000000
number_of_embedded 641119.0 0.001529 0.039543 0.0 0.000000 0.000000 0.000000 2.000000
url_length 641119.0 59.762470 44.894590 1.0 32.000000 47.000000 76.000000 2175.000000
domain_length 641119.0 17.403839 11.360269 0.0 12.000000 16.000000 20.000000 248.000000
path_length 641119.0 15.336243 12.896271 0.0 9.000000 13.000000 18.000000 304.000000
tld_length 641119.0 2.986825 0.904940 0.0 3.000000 3.000000 3.000000 18.000000
first_directory_length 641119.0 8.527999 11.064798 0.0 4.000000 6.000000 9.000000 408.000000
contains_suspicious_word 641119.0 0.076884 0.266408 0.0 0.000000 0.000000 0.000000 1.000000
alpha_char_ratio 641119.0 0.777544 0.116276 0.0 0.735294 0.800000 0.857143 1.000000
digit_char_ratio 641119.0 0.070980 0.100210 0.0 0.000000 0.031250 0.104651 1.000000
special_char_ratio 641119.0 0.084431 0.045670 0.0 0.052632 0.078947 0.111111 0.535714

Outlier Detection:¶

We observe some extremely out-of-range values as outliers, such as in the features url_length, path_length, domain_length, count_repeated_char, and first_directory_length.

For example, the url_length feature exhibited significant extreme values, as indicated by the descriptive statistics:

  • 25th Percentile: 32
  • 75th Percentile: 76
  • Maximum Length: 2175

The maximum value of 2175 is notably high compared to the interquartile range, suggesting the presence of extreme outliers. By applying a filter to remove values exceeding the 95th percentile for url_length, we aim to mitigate the influence of these extreme cases on our analysis and model performance. This step is crucial for enhancing the robustness of our results while ensuring that the majority of meaningful observations are retained.

Moreover, removing outliers from the **`url_lengthshouldure may also help address the remaining outlier features, as they could be correlated and influenced by similar patterns in the data. retained. tained. .

In [14]:
extreme_values = df[ df['url_length'] < df['url_length'].quantile(.95)]

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 7))

df_labels       = df['type'].value_counts()
extreme_labels  = extreme_values['type'].value_counts()

ax[0].pie(df_labels, labels=df_labels.index, autopct='%1.1f%%', startangle=90, 
                 textprops={'fontsize': labelsize}, colors=[color_mapping[label] for label in df_labels.index])
ax[0].set_title(f'Orignal Distribution', fontsize=fontsize, y=-0.1)

ax[1].pie(extreme_labels, labels=extreme_labels.index, autopct='%1.1f%%', startangle=90, 
                 textprops={'fontsize': labelsize}, colors=[color_mapping[label] for label in extreme_labels.index])
ax[1].set_title(f'Extreme Distribution', fontsize=fontsize, y=-0.1)    

fig.suptitle(f'\n URL Types Distribution', fontsize=18, y=1.02)

plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
No description has been provided for this image

Outlier Handling:¶

During the analysis, it was observed that the url_length feature contained extreme outliers that could distort the results. To address this, a threshold was applied to remove the outliers by filtering out URLs with a length greater than the 95th percentile. This step ensures cleaner data without distorting the target distribution, as the distribution of the target variable remained consistent before and after the removal of outliers. Thus, removing these outliers does not negatively affect the overall target distribution or the model's ability to generalize.

In [15]:
df = df[ df['url_length'] < df['url_length'].quantile(.95)]
In [16]:
df.describe().T
Out[16]:
count mean std min 25% 50% 75% max
having_ip_address 608510.0 0.020292 0.140998 0.0 0.000000 0.000000 0.000000 1.000000
is_abnormal_url 608510.0 0.258231 0.437662 0.0 0.000000 0.000000 1.000000 1.000000
has_subdomain 608510.0 0.382896 0.486094 0.0 0.000000 0.000000 1.000000 1.000000
is_risky_tld 608510.0 0.032014 0.176038 0.0 0.000000 0.000000 0.000000 1.000000
is_suspicious_suffix 608510.0 0.046717 0.211033 0.0 0.000000 0.000000 0.000000 1.000000
has_shortening_service 608510.0 0.061628 0.240478 0.0 0.000000 0.000000 0.000000 1.000000
longest_digit_sequence 608510.0 2.304192 3.078039 0.0 0.000000 1.000000 4.000000 81.000000
contains_non_ascii 608510.0 0.001173 0.034234 0.0 0.000000 0.000000 0.000000 1.000000
has_port_number 608510.0 0.008113 0.089708 0.0 0.000000 0.000000 0.000000 1.000000
count_http 608510.0 0.263940 0.449999 0.0 0.000000 0.000000 1.000000 4.000000
count_https 608510.0 0.023904 0.154548 0.0 0.000000 0.000000 0.000000 3.000000
count_alpha 608510.0 40.403134 20.975018 0.0 25.000000 35.000000 53.000000 125.000000
count_digits 608510.0 3.870893 5.847641 0.0 0.000000 1.000000 6.000000 89.000000
count_hex_char 608510.0 0.141838 1.453663 0.0 0.000000 0.000000 0.000000 37.000000
count_dot 608510.0 2.085859 1.161088 0.0 1.000000 2.000000 3.000000 28.000000
count_@ 608510.0 0.001755 0.044076 0.0 0.000000 0.000000 0.000000 6.000000
count_% 608510.0 0.142676 1.454720 0.0 0.000000 0.000000 0.000000 37.000000
count_? 608510.0 0.194409 0.418829 0.0 0.000000 0.000000 0.000000 20.000000
count_- 608510.0 1.419641 2.711581 0.0 0.000000 0.000000 1.000000 41.000000
count_= 608510.0 0.454431 1.195632 0.0 0.000000 0.000000 0.000000 17.000000
count_/ 608510.0 2.845536 1.832456 0.0 1.000000 3.000000 4.000000 28.000000
count_// 608510.0 0.261761 0.445712 0.0 0.000000 0.000000 1.000000 3.000000
sum_special_chars 608510.0 4.699941 4.180913 0.0 2.000000 4.000000 6.000000 42.000000
count_parameters 608510.0 0.450476 1.184976 0.0 0.000000 0.000000 0.000000 17.000000
count_repeated_char 608510.0 5.895251 2.971220 1.0 4.000000 5.000000 7.000000 70.000000
count_subdomain 608510.0 0.444137 0.709705 0.0 0.000000 0.000000 1.000000 26.000000
number_of_directories 608510.0 2.276585 1.508885 0.0 1.000000 2.000000 3.000000 28.000000
number_of_embedded 608510.0 0.001548 0.039772 0.0 0.000000 0.000000 0.000000 2.000000
url_length 608510.0 52.489627 27.845507 1.0 31.000000 45.000000 70.000000 133.000000
domain_length 608510.0 16.689652 7.098847 0.0 12.000000 16.000000 20.000000 132.000000
path_length 608510.0 14.731087 8.562510 0.0 9.000000 14.000000 18.000000 132.000000
tld_length 608510.0 2.980912 0.892058 0.0 3.000000 3.000000 3.000000 18.000000
first_directory_length 608510.0 8.239087 9.639157 0.0 4.000000 6.000000 9.000000 125.000000
contains_suspicious_word 608510.0 0.064053 0.244848 0.0 0.000000 0.000000 0.000000 1.000000
alpha_char_ratio 608510.0 0.781951 0.113515 0.0 0.739130 0.805556 0.859155 1.000000
digit_char_ratio 608510.0 0.066333 0.096428 0.0 0.000000 0.025000 0.098039 1.000000
special_char_ratio 608510.0 0.083615 0.043625 0.0 0.052632 0.078947 0.111111 0.535714

Post-Outlier Removal Analysis:¶

After removing URLs with a url_length greater than the 95th percentile, the descriptive statistics show a much more reasonable range. This adjustment has minimized extreme values, resulting in a dataset that better reflects typical URL characteristics and enhances the reliability of our analysis and model performance.

In [17]:
original_count = 641119.0
new_count = 608510.0
percentage_loss = np.round((original_count - new_count) / original_count * 100, 2)

print(f"By removing the outliers, we lost {percentage_loss}% of the data.")
By removing the outliers, we lost 5.09% of the data.

Data Splitting: Train and Test¶

In this project, we are working on a machine learning model. The data is split into two sets:

  • Training Set (80%): This set will be used to train the model. The model learns from this data by identifying patterns and relationships between the input features and the target variable.

  • Test Set (20%): This set is reserved to evaluate the performance of the trained model. By testing on unseen data, we can check the model's generalization ability, ensuring that it works well not only on the training data but also on new, unseen data.

The 80/20 split is a common practice, balancing the amount of data available for training while keeping enough data to validate the model's performance effectively.

Model Performance Assessment¶

After training the model on the training set, we will assess its performance on the test set using relevant evaluation metrics. This will help us determine how well the model can generalize to new, unseen data and avoid overfitting to the training data.

In [18]:
split = int(len(df) * SPLIT) # SPLIT = 0.8
df = shuffle(df, random_state=RANDOM_STATE)
train, test = df.iloc[:split], df.iloc[split:]

Exploratory Data Analysis (EDA)¶

Target (type) Distribution¶

In [19]:
unique_labels = sorted(set(train['type'].unique()).union(set(test['type'].unique())))
palette_dict = {label: color_mapping[label] for label in unique_labels}

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 14)) 

train_labels = train['type'].value_counts()
ax[0, 0].pie(train_labels, labels=train_labels.index, autopct='%1.1f%%', startangle=90, 
             textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in train_labels.index])
ax[0, 0].set_title(f'Train Distribution', fontsize=fontsize, y=-0.1)

test_labels = test['type'].value_counts()
ax[0, 1].pie(test_labels, labels=test_labels.index, autopct='%1.1f%%', startangle=90, 
             textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in test_labels.index])
ax[0, 1].set_title(f'Test Distribution', fontsize=fontsize, y=-0.1)

sns.countplot(x='type', data=train, ax=ax[1, 0], order=unique_labels, 
              palette=[palette_dict[label] for label in unique_labels])
ax[1, 0].set_title('Train Countplot', fontsize=fontsize)

sns.countplot(x='type', data=test, ax=ax[1, 1], order=unique_labels, 
              palette=[palette_dict[label] for label in unique_labels])
ax[1, 1].set_title('Test Countplot', fontsize=fontsize)

fig.suptitle(f'\n URL Types Distribution', fontsize=18, y=1.02)

plt.tight_layout()
plt.subplots_adjust(top=0.92)  
plt.show()
No description has been provided for this image

Generate the WordCloud on Training Data¶

In [20]:
benign_url      = ' '.join(train[train['type'] == 'benign'].index)
phishing_url    = ' '.join(train[train['type'] == 'phishing'].index)
defacement_url  = ' '.join(train[train['type'] == 'defacement'].index)
malware_url     = ' '.join(train[train['type'] == 'malware'].index)

wordcloud_benign     = WordCloud(width=800, height=400, background_color='white').generate(benign_url)
wordcloud_phishing   = WordCloud(width=800, height=400, background_color='white').generate(phishing_url)
wordcloud_defacement = WordCloud(width=800, height=400, background_color='white').generate(defacement_url)
wordcloud_malware    = WordCloud(width=800, height=400, background_color='white').generate(malware_url)

fig, axs = plt.subplots(2, 2, figsize=(10, 6))

axs[0, 0].imshow(wordcloud_benign, interpolation='bilinear')
axs[0, 0].set_title('Benign URLs')
axs[0, 0].axis('off')

axs[0, 1].imshow(wordcloud_phishing, interpolation='bilinear')
axs[0, 1].set_title('Phishing URLs')
axs[0, 1].axis('off')

axs[1, 0].imshow(wordcloud_defacement, interpolation='bilinear')
axs[1, 0].set_title('php URLs')
axs[1, 0].axis('off')

axs[1, 1].imshow(wordcloud_malware, interpolation='bilinear')
axs[1, 1].set_title('Malware URLs')
axs[1, 1].axis('off')

plt.tight_layout()
plt.show()
No description has been provided for this image

Create Features Based on the WordCloud Insights¶

In [21]:
train = train.reset_index()
test = test.reset_index()

# For Malware URLs
train['exe_in_url']    = train['url'].apply(lambda x: 1 if 'exe' in x.lower() else 0)
train['mozi_in_url']   = train['url'].apply(lambda x: 1 if 'mozi' in x.lower() else 0)
train['jp_in_url']     = train['url'].apply(lambda x: 1 if 'jp' in x.lower() else 0)
train['mitsui_in_url'] = train['url'].apply(lambda x: 1 if 'mitsui' in x.lower() else 0)
train['mixh_in_url']   = train['url'].apply(lambda x: 1 if 'mixh' in x.lower() else 0)

test['exe_in_url']     = test['url'].apply(lambda x: 1 if 'exe' in x.lower() else 0)
test['mozi_in_url']    = test['url'].apply(lambda x: 1 if 'mozi' in x.lower() else 0)
test['jp_in_url']      = test['url'].apply(lambda x: 1 if 'jp' in x.lower() else 0)
test['mitsui_in_url']  = test['url'].apply(lambda x: 1 if 'mitsui' in x.lower() else 0)
test['mixh_in_url']    = test['url'].apply(lambda x: 1 if 'mixh' in x.lower() else 0)

# For Phishing URLs
train['ietf_in_url']  = train['url'].apply(lambda x: 1 if 'ietf' in x.lower() else 0)
train['tools_in_url'] = train['url'].apply(lambda x: 1 if 'tools' in x.lower() else 0)

test['ietf_in_url']   = test['url'].apply(lambda x: 1 if 'ietf' in x.lower() else 0)
test['tools_in_url']  = test['url'].apply(lambda x: 1 if 'tools' in x.lower() else 0)

# For Defacement URLs
train['index_in_url']       = train['url'].apply(lambda x: 1 if 'index' in x.lower() else 0)
train['com_content_in_url'] = train['url'].apply(lambda x: 1 if 'com_content' in x.lower() else 0)
train['option_in_url']      = train['url'].apply(lambda x: 1 if 'option' in x.lower() else 0)
train['php_in_url']         = train['url'].apply(lambda x: 1 if 'php' in x.lower() else 0)

test['index_in_url']        = test['url'].apply(lambda x: 1 if 'index' in x.lower() else 0)
test['com_content_in_url']  = test['url'].apply(lambda x: 1 if 'com_content' in x.lower() else 0)
test['option_in_url']       = test['url'].apply(lambda x: 1 if 'option' in x.lower() else 0)
test['php_in_url']          = test['url'].apply(lambda x: 1 if 'php' in x.lower() else 0)

train.set_index('url', inplace=True)
test.set_index('url', inplace=True)
In [22]:
print( train.shape, test.shape )
(486808, 49) (121702, 49)

Word Cloud Features Analysis¶

In [23]:
unique_labels = sorted(set(train['type'].unique()).union(set(test['type'].unique())))
palette_dict = {label: color_mapping[label] for label in unique_labels}
word_cloud_columns = ['exe_in_url', 'mozi_in_url', 'jp_in_url', 'mitsui_in_url', 'mixh_in_url' ,
                      'ietf_in_url', 'tools_in_url',
                      'index_in_url', 'com_content_in_url', 'option_in_url', 'php_in_url']

for col in word_cloud_columns:
    fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
    fig.subplots_adjust(hspace=0.8, wspace=0.2)

    counts = pd.crosstab(train[col], train['type']) 
    counts.plot(kind='bar', stacked=True, ax=ax[0, 0], color=[palette_dict[label] for label in counts.columns])
    ax[0, 0].set_title(f'Bar Plot of URL Types Based on {col}', fontsize=fontsize)
    ax[0, 0].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
    ax[0, 0].set_ylabel('Count', fontsize=fontsize)
    ax[0, 0].tick_params(axis='x', rotation=45, labelsize=labelsize)
    ax[0, 0].tick_params(axis='y', labelsize=labelsize)
    ax[0, 0].legend(unique_labels, loc='upper right')

    type_counts = pd.crosstab(train[col], train['type'], normalize='index')
    type_counts.plot(kind='bar', ax=ax[0, 1], color=[palette_dict[label] for label in type_counts.columns])
    ax[0, 1].set_title(f'Proportional Distribution of URL Types Based on {col}', fontsize=fontsize)
    ax[0, 1].set_ylabel('Proportion (%)', fontsize=fontsize)
    ax[0, 1].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
    ax[0, 1].tick_params(axis='x', rotation=45, labelsize=labelsize)
    ax[0, 1].tick_params(axis='y', labelsize=labelsize)
    ax[0, 1].legend(unique_labels, loc='upper right')

    val_0 = train[train[col] == 0]['type'].value_counts()
    val_1 = train[train[col] == 1]['type'].value_counts()

    ax[1, 0].pie(val_0, labels=val_0.index, autopct='%1.1f%%', startangle=90, 
                 textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_0.index])
    ax[1, 0].set_title(f'URL Type Distribution ({col} = 0)', fontsize=fontsize, y=-0.1)

    ax[1, 1].pie(val_1, labels=val_1.index, autopct='%1.1f%%', startangle=90, 
                 textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_1.index])
    ax[1, 1].set_title(f'URL Type Distribution ({col} = 1)', fontsize=fontsize, y=-0.1)

    fig.suptitle(f'\nAnalysis of URL Types Distribution of: {col}', fontsize=18, y=1.02)

    plt.tight_layout()
    plt.subplots_adjust(top=0.85)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Word Cloud Feature Analysis Results:¶


  1. Malware Detection:
    • The following features were critical in increasing the detection of Malware URLs:
      • 'exe_in_url': Presence of 'exe' in the URL was a strong indicator of malware.
      • 'mozi_in_url': URLs containing 'mozi' showed a significant correlation with malware.
      • 'jp_in_url': 'jp' in the URL flagged a substantial number of malicious URLs.
      • 'mitsui_in_url': The presence of 'mitsui' led to higher malware detection rates.
      • 'mixh_in_url': URLs with 'mixh' contributed to the enhanced identification of malware.

  1. Phishing Detection:
    • The following features were effective in improving the detection of Phishing URLs:
      • 'ietf_in_url': The presence of 'ietf' significantly raised the detection of phishing attempts.
      • 'tools_in_url': URLs with 'tools' had a strong association with phishing activity.

  1. Defacement Detection:
    • The following features were highly influential in detecting URLs associated with Defacement:
      • 'index_in_url': 'index' in the URL indicated PHP-related URLs more frequently.
      • 'com_content_in_url': This feature significantly increased the detection of URLs with 'com_content', related to PHP-based sites.
      • 'option_in_url': The presence of 'option' in the URL improved identification of PHP URLs.
      • 'php_in_url': Naturally, the feature 'php_in_url' was very effective in detecting PHP-based URLs.

Binary Features Analysis¶

In [24]:
unique_labels = sorted(set(train['type'].unique()).union(set(test['type'].unique())))
palette_dict = {label: color_mapping[label] for label in unique_labels}

for col in categorical_columns:
    fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
    fig.subplots_adjust(hspace=0.8, wspace=0.2)

    counts = pd.crosstab(train[col], train['type']) 
    counts.plot(kind='bar', stacked=True, ax=ax[0, 0], color=[palette_dict[label] for label in counts.columns])
    ax[0, 0].set_title(f'Bar Plot of URL Types Based on {col}', fontsize=fontsize)
    ax[0, 0].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
    ax[0, 0].set_ylabel('Count', fontsize=fontsize)
    ax[0, 0].tick_params(axis='x', rotation=45, labelsize=labelsize)
    ax[0, 0].tick_params(axis='y', labelsize=labelsize)
    ax[0, 0].legend(unique_labels, loc='upper right')

    type_counts = pd.crosstab(train[col], train['type'], normalize='index')
    type_counts.plot(kind='bar', ax=ax[0, 1], color=[palette_dict[label] for label in type_counts.columns])
    ax[0, 1].set_title(f'Proportional Distribution of URL Types Based on {col}', fontsize=fontsize)
    ax[0, 1].set_ylabel('Proportion (%)', fontsize=fontsize)
    ax[0, 1].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
    ax[0, 1].tick_params(axis='x', rotation=45, labelsize=labelsize)
    ax[0, 1].tick_params(axis='y', labelsize=labelsize)
    ax[0, 1].legend(unique_labels, loc='upper right')

    val_0 = train[train[col] == 0]['type'].value_counts()
    val_1 = train[train[col] == 1]['type'].value_counts()

    ax[1, 0].pie(val_0, labels=val_0.index, autopct='%1.1f%%', startangle=90, 
                 textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_0.index])
    ax[1, 0].set_title(f'URL Type Distribution ({col} = 0)', fontsize=fontsize, y=-0.1)

    ax[1, 1].pie(val_1, labels=val_1.index, autopct='%1.1f%%', startangle=90, 
                 textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_1.index])
    ax[1, 1].set_title(f'URL Type Distribution ({col} = 1)', fontsize=fontsize, y=-0.1)

    fig.suptitle(f'\nAnalysis of URL Types Distribution of: {col}', fontsize=18, y=1.02)

    plt.tight_layout()
    plt.subplots_adjust(top=0.85)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Additional Binary Feature Analysis Results:¶


  1. Malware Detection:
    • 'having_ip_address = 1': URLs that contain an IP address strongly increase the likelihood of being classified as Malware URLs. However, there are relatively few such observations.
    • 'has_port_number = 1': URLs with a port number are strongly associated with Malware URLs, but the number of such occurrences is low.
    • 'is_suspicious_suffix = 1': URLs with suspicious suffixes (like uncommon or unusual domain extensions) show a significant increase in the likelihood of being detected as Malware URLs.

  1. Defacement Detection:
    • 'is_abnormal_url = 1': Abnormal URLs are highly correlated with Defacement URLs and slightly increase the likelihood of detecting Malware URLs.
    • 'contains_non_ascii = 1': URLs containing non-ASCII characters are strongly linked to Defacement URLs, although these cases are rare.
    • 'is_abnormal_url = 1': URLs classified as abnormal significantly increase the likelihood of being detected as both Phishing and Defacement URLs.

  1. Phishing Detection:
    • 'is_risky_tld = 1': URLs with risky top-level domains (TLDs) increase the likelihood of detecting both Phishing URLs and Defacement URLs, but there are very few observations.
    • 'contains_suspicious_word = 1': The presence of suspicious words in the URL is strongly correlated with Phishing URLs.

  1. No Significant Change:
    • 'has_shortening_service = 1': URLs using shortening services did not show a significant change in the distribution or detection rates for any specific type of malicious URL.

malicious URL. cific URL type.

Numerical Features Analysis¶

In [25]:
n_rows = 14
n_cols = 2

fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 36))
axes = axes.flatten()  

for i, col in enumerate(numerical_columns):
    ax = axes[i]
    sns.violinplot(data=train, x='type', y=col, palette='Set2', ax=ax)

    ax.set_title(f'Analysis of URL Types Distribution of: {col}', fontsize=10)
    ax.set_xlabel('Type', fontsize=8)
    ax.set_ylabel(col, fontsize=8)
    ax.grid(True)

plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis of 'count_@':¶

  • Overall Summary: The feature 'count_@' shows negligible variation across different URL types. Almost all the observations for this feature have a value of zero, regardless of whether the URL is benign, phishing, defacement, or malware.
  • Conclusion: Since this feature is almost always zero across all types of URLs, it provides little to no useful information for distinguishing between different types of URLs. Therefore, 'count_@' can be dropped from further analysis without impacting the model's performance.

Analysis of 'number_of_embedded':¶

  • Overall Summary: The feature 'number_of_embedded' also exhibits very little variation. Nearly all URLs, across all types (benign, phishing, defacement, malware), have zero embedded elements. There is hardly any differentiation between URL types based on this feature.
  • Conclusion: Given the lack of variation and low discriminatory power of the 'number_of_embedded' feature, it can be safely removed from the dataset. Retaining this feature is unlikely to contribute any meaningful insights or improve model accuracy.

Analysis of 'digit_char_ratio', 'alpha_char_ratio', 'url_length', 'count_alpha', 'count_digits', 'special_char_ratio', and 'sum_special_chars':¶

  • Overall Summary: Each of these features — 'digit_char_ratio', 'alpha_char_ratio', 'url_length', 'count_alpha', 'count_digits', 'special_char_ratio', and 'sum_special_chars' — displays useful patterns across different URL types. These features provide critical insights into the structure and composition of URLs, which helps in distinguishing between benign, phishing, defacement, and malware URLs.

  • Feature Interdependence:

    • 'url_length', 'count_alpha', and 'count_digits' are fundamental metrics that capture the length of the URL and the counts of alphabetic and numeric characters. Malicious URLs may use unusual character distributions, which makes these features particularly informative.
    • 'digit_char_ratio' and 'alpha_char_ratio' are derived from the raw counts, capturing the proportion of numeric and alphabetic characters in the URL. These ratios are complementary to the absolute counts, providing a more nuanced view of how certain character types dominate different types of URLs.
    • 'special_char_ratio' and 'sum_special_chars' specifically focus on special characters (like punctuation or symbols). Given that the presence of special characters often indicates deliberate URL obfuscation, especially in phishing or malware URLs, both the raw count and the ratio relative to the URL length provide valuable signals.
  • Conclusion: Both the raw counts ('url_length', 'count_alpha', 'count_digits', 'sum_special_chars') and their corresponding ratios ('digit_char_ratio', 'alpha_char_ratio', 'special_char_ratio') are highly informative features. They provide complementary views of URL composition and contribute significantly to identifying potential malicious URLs. Therefore, all of these features should be retained in the analysis as they offer substantial predictiv


Analysis of repeated_char_ratio:¶

Upon analyzing the distribution of 'count_repeated_char', it appears to be a useful feature for distinguishing between URL types. This makes sense, as repeated characters can often be indicative of obfuscation in malicious URLs

To enhance the feature further, we will create a new feature, 'repeated_char_ratio', by dividing 'count_repeated_char' by 'url_length'. This will normalize the count of repeated characters based on the length of the URL, allowing us to capture meaningful patterns across URLs of different lengt


Conclusion for Remaining Features:¶

For the remaining features such as 'longest_digit_sequence', 'count_https', 'count_http', 'count_dot', 'count_%', 'count_?', 'count_-', 'count_=', 'count_/', 'count_//', 'count_parameters', 'count_subdomain', 'number_of_directories', 'domain_length', 'path_length', 'tld_length', and 'first_directory_length', we observe that some show slight differences in distributions across URL types.

While these features don't exhibit as pronounced separation as others, they are still logically relevant to the problem of identifying different URL types. For instance:

  • The counts of specific characters ('count_https', 'count_dot', 'count_/', 'count_parameters') are important since they can hint at URL structure differences.
  • Features like 'longest_digit_sequence', 'number_of_directories', 'domain_length', and 'path_length' may help distinguish more complex URLs, particularly in malicious cases like phishing and malware.

Although the separation is subtle, these features likely contribute useful information when combined with others, especially in more advanced models. Therefore, we will retain these features for further analysis and potential model input. s. hs. e power.

In [26]:
#  Drop Features ('count_@', 'number_of_embedded') with very little variance and therefore, provide minimal information.
train.drop(['count_@', 'number_of_embedded'], axis=1, inplace=True)
test.drop(['count_@', 'number_of_embedded'], axis=1, inplace=True)

numerical_columns.remove('count_@')
numerical_columns.remove('number_of_embedded')
In [27]:
# Create new features (repeated_char_ratio):
train['repeated_char_ratio'] = train['count_repeated_char'] / train['url_length']
test['repeated_char_ratio']  = test['count_repeated_char']  / test['url_length']

numerical_columns.append('repeated_char_ratio')

Feature Selection + Statistical Test¶


What is Feature Selection?¶

Feature selection is the process of identifying and selecting the most relevant features (or variables) for use in machine learning models. By choosing the most important features, we aim to reduce the dimensionality of the data, which can lead to better model performance, lower computational cost, and improved interpretability.

Purpose of Feature Selection¶

  1. Improves Model Performance: Irrelevant or redundant features can introduce noise and decrease the performance of a model. Selecting the right set of features helps in building models that generalize better.
  2. Reduces Overfitting: By removing irrelevant features, the model is less likely to memorize the training data, which reduces the risk of overfitting and improves performance on unseen data.
  3. Increases Model Interpretability: Fewer features make models simpler to understand and interpret, especially in real-world applications.
  4. Decreases Computational Cost: Working with fewer features reduces training time and makes the model more efficient, which is especially important when working with large datasets.

Workflow¶

Step 1: Statistical Test Using Linear Methods¶

We start by applying linear feature selection methods to filter out irrelevant features. The two methods used are:

  • ANOVA F-Test (f_classif): This test is applied to numerical features to check if they have a linear relationship with the target variable.
  • Chi-Square Test: This is used for binary/categorical features to assess the relevance of each feature to the target.

These tests provide an initial filter to remove features that have weak linear associations with the target.

Step 2: Feature Selection Using BorutaPy¶

After the initial filtering, we apply BorutaPy, which is an all-relevant feature selection method based on a Random Forest model. BorutaPy iteratively tests whether each feature is truly important by comparing its importance with randomized (shadow) features.

Boruta helps identify both strong and weak relevant features, including those with complex interactions, which might not be captured by linear tests like ANOVA or Chi-Square.

Step 3: Further Feature Selection Using Mutual Information¶

Once BorutaPy has selected the best features, we refine the selection using Mutual Information (MI). MI measures the dependency between each feature and the target, capturing both linear and non-linear relationships.

We select the top features based on their mutual information scores, typically choosing the same number of features as selected by BorutaPy (len(boruta_selected_features)).

Step 4: Model Building and Comparison¶

In this step, we train models using different sets of features and compare their performance. Specifically, we build and evaluate:

  1. Model 1: Using the full dataset with all features.
  2. Model 2: Using features selected by BorutaPy.
  3. Model 3: Using features selected by Mutual Information (refined from Boruta-selected features).

We then compare the performance of these models using relevant evaluation metrics (accuracy, precision, recall, F-1, AUC) to assess which feature selection method or dataset produces the best results.

In [29]:
univariate = f_classif(train[numerical_columns], train['type'])

univariate = pd.Series(univariate[1])
univariate.index = numerical_columns
univariate.sort_values(ascending=False).plot.bar(figsize=(10, 6))
plt.show()
No description has been provided for this image

f_classif¶

What is f_classif?¶

f_classif is a statistical test that helps evaluate the relationship between each feature and the target variable in classification tasks. It is based on the ANOVA (Analysis of Variance) test, which measures how much variance in the target variable is explained by each feature.

How Does f_classif Work?¶

  1. Input:

    • Takes features (X) and a categorical target (y).
  2. ANOVA F-value:

    • Computes an F-statistic for each feature, which shows the ratio of variance between the groups (classes) to the variance within each group.
  3. Feature Importance:

    • A higher F-value indicates that the feature provides more information to predict the target.
  4. Output:

    • Returns an array of F-values and their corresponding p-values, where smaller p-values suggest that the feature is statistically significant.
In [30]:
univariate = f_classif(train[numerical_columns], train['type'])
univariate = pd.Series(univariate[1])
univariate.index = numerical_columns

ax = univariate.sort_values(ascending=True).plot.bar(figsize=(10, 6))
ax.set_title('Univariate Feature Importance Using ANOVA F-test', fontsize=16)
ax.set_xlabel('Features', fontsize=14)
ax.set_ylabel('p-values', fontsize=14)
plt.show()
No description has been provided for this image

ANOVA F-test (f_classif) Results:¶

The F-test results reveal that all features have p-values equal to 0. This suggests that there is a significant difference in the mean values of the features across the target variable categories. Therefore, these features are highly relevant and important for predictive modeling.

¶

Chi-Square Test¶

The Chi-Square Test evaluates the association between categorical variables.

Key Points:¶

  • Types:
    • Independence Test: Checks relationships between two variables.
    • Goodness of Fit Test: Compares observed distributions to expected ones.

Process:¶

  1. Set null and alternative hypotheses.
  2. Create a contingency table.
  3. Calculate expected frequencies.
  4. Compute the Chi-Square statistic and p-value.
  5. Reject the null hypothesis if p < significance level (0.05).

pretability.

In [31]:
chi_ls = []
for feature in categorical_columns + word_cloud_columns:
    c = pd.crosstab(train['type'], train[feature])    
    p_value = stats.chi2_contingency(c)[1]
    chi_ls.append(p_value)

chi_test = pd.Series(chi_ls, index=categorical_columns + word_cloud_columns).sort_values(ascending=True)
chi_test.plot.bar(rot=45)
plt.ylabel('p value')
plt.title('Feature importance based on chi-square test')
plt.show()
No description has been provided for this image

Chi-Square Test Results:¶

The Chi-square test results show that all features have extremely low p-values, 0 or close to 0 (1.59e-147). This indicates a strong association between each feature and the target variable. These features are likely valuable predictors and should be considered for further analysis in the model.


¶


Overall Summary:¶

From the feature selection tests conducted, both categorical and numerical features show strong influence on the target variable.

  • The Chi-square test revealed extremely low p-values for all categorical features, indicating a significant association with the target variable. This suggests that these features are highly informative and relevant for modeling.

  • The ANOVA F-test (f_classif) also showed p-values of 0 for all numerical features, meaning there is a significant difference in the means of these features across the different target categories. These numerical features are equally important for the model.

In conclusion, both categorical and numerical features demonstrate strong statistical significance and influence on the target, making them valuable predictors for further modeling.

Encoding Target (type) AND Split to X & y¶

In [32]:
encoding_map = {"benign": 0, "phishing": 1, "defacement": 2, "malware": 3}

train['type'] = train['type'].apply(lambda x: encoding_map[x])
test['type'] = test['type'].apply(lambda x: encoding_map[x])

X_train, y_train = train.drop('type', axis=1), train['type']
X_test, y_test   = test.drop('type', axis=1), test['type']

BorutaPy¶

What is BorutaPy?¶

BorutaPy is a feature selection method specifically designed for use with tree-based models, such as Random Forest. It is an implementation of the Boruta algorithm, which is a wrapper method that iteratively selects the most important features while considering their importance in relation to shuffled, random features (called "shadow features").

The Boruta algorithm is designed to identify all relevant features, rather than just the minimal set required for good model performance. This makes it particularly useful in complex datasets where interactions between features are important.

How Does BorutaPy Work?¶

  1. Create Shadow Features:

    • Boruta duplicates the original features and shuffles them to create shadow features. These shadow features serve as a baseline for feature importance comparison.
  2. Train Random Forest:

    • A Random Forest model is trained on the dataset including both the original and shadow features.
  3. Feature Importance Comparison:

    • The algorithm computes feature importance scores for both original and shadow features.
    • Features that have higher importance than the most important shadow feature are considered "important."
    • Features with lower importance are considered "unimportant."
  4. Iterative Process:

    • The process is repeated for a set number of iterations or until all features are determined to be either important or unimportant.
    • Features that remain undetermined after several iterations are categorized as "tentative."
  5. Final Selection:

    • At the end of the iterations, Boruta selects features that consistently show higher importance than shadow features, ensuring that no potentially relevant feature is discarded too early.
In [33]:
%%time
rf = RandomForestClassifier(n_estimators=20, random_state=RANDOM_STATE, n_jobs=4)
boruta_selector = BorutaPy(rf, n_estimators=4, random_state=RANDOM_STATE)
boruta_selector.fit(X_train.values, y_train)

boruta_selected_features_mask = boruta_selector.support_
boruta_selected_features = X_train.columns[boruta_selected_features_mask]

selected_feature_indices = np.where(boruta_selected_features_mask)[0]
selected_feature_importances = boruta_selector.estimator.feature_importances_[selected_feature_indices]

feature_importance_df = pd.DataFrame({
    'Feature': boruta_selected_features,
    'Importance': selected_feature_importances
})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df, palette='viridis')
plt.title('Feature Importances from BorutaPy')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

print()
plot_corr_matrix(X_train[boruta_selected_features])
No description has been provided for this image
CPU times: total: 19min 12s
Wall time: 6min 12s
No description has been provided for this image

Mutual Information for Classification - mutual_info_classif¶

What is Mutual Information?¶

Mutual information measures the amount of information gained about one variable by knowing another. In this context, it quantifies how much knowing a particular feature reduces uncertainty about the target class.

How Does mutual_info_classif Work?¶

  1. Estimates Dependency:

    • Mutual information measures the dependency between a feature and the target variable. A high score means that the feature provides a lot of information about the target (it helps reduce uncertainty about the target).
  2. Captures Non-linear Relationships:

    • Unlike methods like correlation that only capture linear relationships, mutual information can detect both linear and non-linear relationships between features and the target.
  3. Ranges Between 0 and 1:

    • Mutual information is always non-negative. A score of 0 means the feature and the target are independent (the feature provides no useful information about the target).
  4. Discretization for Continuous Data:

    • For continuous features, mutual_info_classif uses k-nearest neighbors to discretize the data, allowing it to work effectively with both categorical and continuous data.
In [34]:
%%time
selector = SelectKBest(score_func=mutual_info_classif, k=len(boruta_selected_features)) 

selector.fit(X_train, y_train)
selected_features_mask = selector.get_support()

mi_selected_features = X_train.columns[selected_features_mask]
feature_scores    = selector.scores_

selected_features_scores = pd.DataFrame({
    'Feature': mi_selected_features,
    'Score': feature_scores[selected_features_mask]
})

selected_features_scores = selected_features_scores.sort_values(by='Score', ascending=True)

plt.figure(figsize=(10, 6))
sns.barplot(x='Score', y='Feature', data=selected_features_scores, palette='viridis')
plt.title('Top Selected Features Based on Mutual Information Scores')
plt.xlabel('Mutual Information Score')
plt.ylabel('Selected Features')
plt.gca().invert_yaxis() 
plt.tight_layout()
plt.show()

print()
plot_corr_matrix(X_train[mi_selected_features])
No description has been provided for this image
CPU times: total: 3min 10s
Wall time: 3min 12s
No description has been provided for this image
In [35]:
boruta_feature_set = set(boruta_selected_features)
mi_feature_set = set(mi_selected_features)
unique_to_boruta = boruta_feature_set - mi_feature_set
unique_to_mi = mi_feature_set - boruta_feature_set

common_features = boruta_feature_set.intersection(mi_feature_set)

print("Common Features:", common_features)
print("\nUnique to Boruta:", unique_to_boruta)
print("\nUnique to MI:", unique_to_mi)
Common Features: {'index_in_url', 'number_of_directories', 'special_char_ratio', 'path_length', 'sum_special_chars', 'count_digits', 'digit_char_ratio', 'count_http', 'tld_length', 'count_//', 'repeated_char_ratio', 'count_subdomain', 'count_dot', 'count_alpha', 'count_repeated_char', 'alpha_char_ratio', 'is_abnormal_url', 'url_length', 'has_subdomain', 'longest_digit_sequence', 'count_/', 'first_directory_length'}

Unique to Boruta: {'exe_in_url', 'count_-', 'count_https', 'domain_length'}

Unique to MI: {'count_parameters', 'count_=', 'php_in_url', 'option_in_url'}

¶

¶


Building Models¶


We will use two kinds of models:¶

  1. Traditional Models (Base-tree models) + Optuna:
    • XGBoost
    • LightGBM
    • CatBoost

Why use them?

  • These models deliver high accuracy and efficiency on a wide range of predictive tasks.
  • They are known for their scalability and are quite fast, allowing rapid training even on large datasets.
  • Their ability to manage imbalanced data and outliers makes them highly flexible for real-world applications.
  • They do not require feature scaling, making preprocessing simpler compared to models like logistic regression or SVM.
  • They can handle correlated features effectively due to the way tree-based models work, reducing the impact of multicollinearity on performance.

  1. Deep Learning Models
    • FNN
    • FNN + BERT

¶

Hyperparameter Optimization¶

What is Hyperparameter Optimization?¶

Hyperparameter optimization (HPO) is the process of selecting the best combination of hyperparameters for a machine learning model. Hyperparameters are the configuration settings that are not learned from the data during the training process, but instead set before training begins. Examples of hyperparameters include:

  • Learning rate
  • Regularization parameters
  • Tree depth in tree-based models

HPO aims to improve model performance by tuning these hyperparameters to achieve the best possible predictive accuracy.

Why is Hyperparameter Optimization Important?¶

  1. Model Performance: The right set of hyperparameters can significantly enhance the model's accuracy, robustness, and generalization to unseen data.
  2. Avoiding Overfitting: Proper tuning helps to prevent overfitting, where a model performs well on training data but poorly on new data.
  3. Resource Efficiency: HPO helps to utilize computational resources effectively by identifying the best configurations quickly, avoiding unnecessary computations.
  4. Automating the Process: Automated HPO methods can save time and reduce human bias in the selection of hyperparameters.

What is Optuna?¶

Optuna is an open-source hyperparameter optimization framework that automates the hyperparameter tuning process. It is designed for flexibility and ease of use, allowing users to define optimization objectives and manage trials efficiently. Optuna employs advanced optimization algorithms to explore the hyperparameter space intelligently.

Why Use Optuna?¶

  1. Efficiency: Optuna uses sophisticated algorithms (like Tree-structured Parzen Estimator) to find optimal hyperparameters quickly, reducing the number of trials needed.
  2. Automatic Pruning: The pruning feature allows for real-time stopping of trials that are unlikely to perform well, further enhancing efficiency.
  3. Ease of Use: With its simple API, users can define complex optimization problems with minimal effort.
  4. Visualization: Optuna provides visualization tools to help analyze the optimization process, making it easier to understand hyperparameter effects.
  5. Community and Support: As an open-source tool, Optuna has strong community support and regular updates, ensuring it stays relevant and powerful.

In summary, hyperparameter optimization is crucial for improving machine learning model performance, and Optuna is a robust tool that facilitates efficient and effective HPO.


We will use Optuna to perform hyperparameter tuning for each model, optimizing for the best performance. This tuning will be complemented by k-fold cross-validation using cross_val_score from scikit-learn, which allows us to evaluate model performance robustly across different subsets of the data.

Traditional Models¶

In [533]:
def objective(trial: optuna.Trial, X_train, y_train, model):

    classifier = classifier_name = type(model).__name__
    sample_weight = compute_sample_weight('balanced', y_train)

    if classifier == "XGBClassifier":
                
        dtrain = xgb.DMatrix(X_train, label=y_train, weight=sample_weight)
        eval_metric = trial.suggest_categorical("eval_metric", ['auc', 'mlogloss'])    
        num_boost_round = trial.suggest_int("n_estimators", 60, 100)
        params = {
            'max_depth': trial.suggest_int("max_depth", 5, 9),
            'learning_rate': trial.suggest_float("learning_rate", 0.1, 0.3),
            'subsample': trial.suggest_float("subsample", 0.7, 0.9),
            'colsample_bynode': trial.suggest_float("colsample_bynode", 0.7, 0.9),
            'colsample_bylevel': trial.suggest_float("colsample_bylevel", 0.7, 0.9),
            'colsample_bytree': trial.suggest_float("colsample_bytree", 0.7, 0.9),
            'reg_lambda': trial.suggest_float("reg_lambda", 1e-1, 1.0),
            'reg_alpha': trial.suggest_float("reg_alpha", 1e-1, 1.0),
            'min_child_weight': trial.suggest_float("min_child_weight", 0.5, 3.0),
            'max_delta_step': trial.suggest_float("max_delta_step", 0.5, 3.0),
            'booster': trial.suggest_categorical("booster", ['dart']),
            'objective': trial.suggest_categorical("objective", ['multi:softmax']),
            'num_class': trial.suggest_categorical("num_class", [4]),
            'random_state': trial.suggest_categorical("random_state", [ RANDOM_STATE ]),
            'nthread': trial.suggest_categorical("nthread", [ 4 ]),
            'n_jobs': trial.suggest_categorical("n_jobs", [ 1 ]),
            'eval_metric': eval_metric,
        }    
        cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round= num_boost_round, 
            metrics=[eval_metric],  
            nfold=3,
            as_pandas=True,  
            shuffle=False
        )
        return cv_results['test-' + eval_metric + '-mean'].mean()       
    

    elif classifier == "LGBMClassifier":
        dtrain = lgb.Dataset(X_train, label=y_train, weight=sample_weight) 
        params = {
            'objective': trial.suggest_categorical("objective", [ 'multiclass' ]),
            'metric': trial.suggest_categorical("eval_metric", ['multi_logloss']),
            'num_class': trial.suggest_categorical("num_class", [ 4 ]),
            'max_depth': trial.suggest_int("max_depth", 5, 9),
            'num_leaves': trial.suggest_int("num_leaves", 31, 96),
            'learning_rate': trial.suggest_float("learning_rate", 0.1, 0.3),
            'boosting_type': trial.suggest_categorical("boosting_type", ['dart']),
            'min_child_samples': trial.suggest_int("min_child_samples", 40, 100),
            'min_child_weight': trial.suggest_loguniform('min_child_weight', 0.01, 1.0),
            'subsample': trial.suggest_float("subsample", 0.7, 0.9),
            'colsample_bytree': trial.suggest_float("colsample_bytree", 0.7, 0.9),                    
            'reg_lambda': trial.suggest_float("reg_lambda", 1e-1, 1.0),
            'reg_alpha': trial.suggest_float("reg_alpha", 1e-1, 1.0),
            'seed': trial.suggest_categorical("seed", [ RANDOM_STATE ]),
            'n_jobs': trial.suggest_categorical("n_jobs", [ 1 ]),
            'force_row_wise': True,
            'verbosity': -1
        }
        num_boost_round = trial.suggest_int('num_boost_round', 60, 100)
        
        cv_results = lgb.cv(
            params,
            dtrain,
            nfold=3,
            num_boost_round=num_boost_round,
            stratified=True,
        )
        return cv_results['valid multi_logloss-mean'][-1]


    elif classifier == "CatBoostClassifier":
        params = {
            'iterations': trial.suggest_int("iterations", 60, 100),
            'depth': trial.suggest_int("depth", 5, 9),
            'learning_rate': trial.suggest_float("learning_rate", 0.1, 0.3, log=True),
            'l2_leaf_reg': trial.suggest_float("l2_leaf_reg", 1e-1, 10.0),
            'bootstrap_type': trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli"]),
            'random_strength': trial.suggest_float("random_strength", 0.1, 10.0),
            'border_count': trial.suggest_int("border_count", 1, 127),
            'random_seed': trial.suggest_categorical("random_seed", [RANDOM_STATE]),
            'loss_function': 'MultiClass',
            'logging_level': 'Silent'
        }
    
        if params['bootstrap_type'] == 'Bayesian':
            params['bagging_temperature'] = trial.suggest_float("bagging_temperature", 0.0, 1.0)
    
        cv_results = cat.cv(
            cat.Pool(X_train, y_train, weight=sample_weight),
            params=params,
            nfold=3,
        )
    
        return cv_results['test-MultiClass-mean'].mean()

Handling Imbalanced Data in Machine Learning¶

What is Imbalanced Data?¶

Imbalanced data refers to a situation in machine learning where the distribution of classes is not uniform. In a multi-class classification problem, one or more classes can have significantly fewer samples compared to others. For instance, in the given case, the distribution is as follows:

  • Label 0: 67.4% (benign)
  • Label 1: 14.9% (phishing)
  • Label 2: 13.8% (defacement)
  • Label 3: 3.9% (malware)

In this scenario, label 3 is heavily underrepresented, which can lead to biased predictions and inadequate learning for that class.

Why is Imbalanced Data a Problem?¶

Imbalanced data can lead to several issues in model training and evaluation:

  • Biased Predictions: Models trained on imbalanced data tend to favor the majority class, leading to poor performance on the minority class.
  • Misleading Metrics: Accuracy can be a misleading metric; a model that predicts the majority class well may still perform poorly overall if it fails to identify the minority class.
  • Poor Generalization: The model may not learn the underlying patterns of the minority class, resulting in a lack of generalization and high error rates when making predictions on unseen data.

What is compute_sample_weight?¶

compute_sample_weight is a utility function from sklearn.utils that helps assign weights to samples in a dataset based on their class distribution. It generates a weight for each instance to give more importance to underrepresented classes.

Why Use compute_sample_weight with CatBoost, LightGBM, and XGBoost?¶
  • Tree-Based Models: Algorithms like CatBoost, LightGBM, and XGBoost can benefit from sample weights because they allow the model to adjust the learning process according to the importance of each sample.
  • Handling Imbalance: By applying sample weights, these models can better learn the minority class during training, thus improving classification performance across all classes.
  • Flexible Weighting: This approach allows you to assign different weights based on additional context or business logic, further enhancing model performance on critical instances.

In conclusion, addressing imbalanced data through methods like compute_sample_weight is crucial for improving the robustness and accuracy of machine learning models, particularly in tree-based frameworks.

In [547]:
%%time

models = {
    "XGB" : {"model": xgb.XGBClassifier},
    "LGBM": {"model": lgb.LGBMClassifier},
    "CAT" : {"model": cat.CatBoostClassifier},
}

feature_selection_techniques = {"all_features": X_train.columns, "Boruta_features": boruta_selected_features, "MI_features": mi_selected_features}
trials = 5

feature_set_result = {}
best_models        = {}

for name, model_dict in models.items():        
    for key, feature_list in feature_selection_techniques.items():
        print(f"Running {name} with - {key}")

        
        study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE))
        start_time = time.time()
        study.optimize(lambda trial: objective(trial, X_train[feature_list], y_train, model_dict["model"]()), n_trials=trials, n_jobs=1)
        end_time = time.time()
        optuna_time_elapsed = end_time - start_time
                
        best_params = study.best_trial.params        
        best_model = model_dict["model"](**best_params)
        sample_weight = compute_sample_weight('balanced', y_train)
        
        start_time = time.time()
        best_model.fit(X_train[feature_list], y_train, sample_weight=sample_weight)
        end_time = time.time()
        time_elapsed = end_time - start_time
        
        y_pred = best_model.predict(X_test[feature_list])
        
        accuracy  = accuracy_score( y_test, y_pred )
        precision = precision_score( y_test, y_pred, average='macro' )
        recall    = recall_score( y_test, y_pred, average='macro' )
        f1        = f1_score( y_test, y_pred, average='macro' )

        if name not in best_models:
            best_models[name] = {}  
        best_models[name][key] = best_model
        

        if key not in feature_set_result:
            feature_set_result[key] = {}
            
        feature_set_result[key][name] = {
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1-Score': f1,
            'TimeElapsed': time_elapsed,
            'Optuna_TimeElapsed': optuna_time_elapsed
        }
        
    print("\n----------------------------------------------------\n")
[I 2024-10-23 09:31:47,686] A new study created in memory with name: no-name-6848f628-0dbc-4950-b8df-cb664c6edf43
Running XGB with - all_features
[I 2024-10-23 09:53:53,590] Trial 0 finished with value: 0.3726073402698043 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.3726073402698043.
[I 2024-10-23 10:32:21,768] Trial 1 finished with value: 0.9893102220774443 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 10:57:57,064] Trial 2 finished with value: 0.9871548896245902 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 11:15:55,937] Trial 3 finished with value: 0.2532936644764878 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 11:37:19,235] Trial 4 finished with value: 0.2803199337494971 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 12:30:39,896] A new study created in memory with name: no-name-92373be5-dac7-400c-a0ae-556de54309da
Running XGB with - Boruta_features
[I 2024-10-23 12:43:59,433] Trial 0 finished with value: 0.38226456908590534 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.38226456908590534.
[I 2024-10-23 13:06:48,202] Trial 1 finished with value: 0.9881188472914543 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 13:25:28,732] Trial 2 finished with value: 0.9858997093684756 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 13:38:04,560] Trial 3 finished with value: 0.2650465790535372 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 13:52:30,938] Trial 4 finished with value: 0.2909791570453547 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 14:22:10,733] A new study created in memory with name: no-name-6fe0953d-44e6-41a1-9acf-8732974a3815
Running XGB with - MI_features
[I 2024-10-23 14:35:30,272] Trial 0 finished with value: 0.4094491047567859 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.4094491047567859.
[I 2024-10-23 15:00:18,849] Trial 1 finished with value: 0.98652911126521 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 15:19:20,253] Trial 2 finished with value: 0.9840133866564393 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 15:31:48,659] Trial 3 finished with value: 0.28565134511602036 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 15:46:01,117] Trial 4 finished with value: 0.3178367951841102 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 16:15:49,202] A new study created in memory with name: no-name-a0e2dfff-ad1e-430a-83da-8c2658eb7f24
----------------------------------------------------

Running LGBM with - all_features
[I 2024-10-23 16:20:18,471] Trial 0 finished with value: 0.21128037664698396 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 77, 'learning_rate': 0.1376303920077012, 'boosting_type': 'dart', 'min_child_samples': 42, 'min_child_weight': 0.025706201341357374, 'subsample': 0.7212125748952531, 'colsample_bytree': 0.845448028736891, 'reg_lambda': 0.7114604711726275, 'reg_alpha': 0.5264611330673967, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 78}. Best is trial 0 with value: 0.21128037664698396.
[I 2024-10-23 16:24:34,273] Trial 1 finished with value: 0.2229241640471742 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 5, 'num_leaves': 80, 'learning_rate': 0.2204897078016253, 'boosting_type': 'dart', 'min_child_samples': 98, 'min_child_weight': 0.21317550217208076, 'subsample': 0.8213259238637353, 'colsample_bytree': 0.7898302629863433, 'reg_lambda': 0.30281874687342597, 'reg_alpha': 0.7031568672034261, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 90}. Best is trial 1 with value: 0.2229241640471742.
[I 2024-10-23 16:28:04,368] Trial 2 finished with value: 0.18669233216569361 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 37, 'learning_rate': 0.2921819484473355, 'boosting_type': 'dart', 'min_child_samples': 55, 'min_child_weight': 0.036671632089617025, 'subsample': 0.853650786932557, 'colsample_bytree': 0.8595846794229967, 'reg_lambda': 0.5896334785603745, 'reg_alpha': 0.44443686758197776, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 75}. Best is trial 1 with value: 0.2229241640471742.
[I 2024-10-23 16:31:05,025] Trial 3 finished with value: 0.22643134709754595 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 79, 'learning_rate': 0.1477973664871458, 'boosting_type': 'dart', 'min_child_samples': 66, 'min_child_weight': 0.5848943222502431, 'subsample': 0.7578562280665435, 'colsample_bytree': 0.8569013714174842, 'reg_lambda': 0.7830582910962314, 'reg_alpha': 0.47600684644099023, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 69}. Best is trial 3 with value: 0.22643134709754595.
[I 2024-10-23 16:36:38,141] Trial 4 finished with value: 0.18538714391724262 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 35, 'learning_rate': 0.21928653738419931, 'boosting_type': 'dart', 'min_child_samples': 91, 'min_child_weight': 0.6094986845757975, 'subsample': 0.7401054887766454, 'colsample_bytree': 0.8004790468730579, 'reg_lambda': 0.9058436600151478, 'reg_alpha': 0.3303288382494436, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 95}. Best is trial 3 with value: 0.22643134709754595.
[I 2024-10-23 16:37:51,994] A new study created in memory with name: no-name-14a5bfb0-d65d-4048-a3a8-6997bde63d13
Running LGBM with - Boruta_features
[I 2024-10-23 16:42:15,497] Trial 0 finished with value: 0.22158941529393283 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 77, 'learning_rate': 0.1376303920077012, 'boosting_type': 'dart', 'min_child_samples': 42, 'min_child_weight': 0.025706201341357374, 'subsample': 0.7212125748952531, 'colsample_bytree': 0.845448028736891, 'reg_lambda': 0.7114604711726275, 'reg_alpha': 0.5264611330673967, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 78}. Best is trial 0 with value: 0.22158941529393283.
[I 2024-10-23 16:46:33,768] Trial 1 finished with value: 0.2351351626097232 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 5, 'num_leaves': 80, 'learning_rate': 0.2204897078016253, 'boosting_type': 'dart', 'min_child_samples': 98, 'min_child_weight': 0.21317550217208076, 'subsample': 0.8213259238637353, 'colsample_bytree': 0.7898302629863433, 'reg_lambda': 0.30281874687342597, 'reg_alpha': 0.7031568672034261, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 90}. Best is trial 1 with value: 0.2351351626097232.
[I 2024-10-23 16:50:13,568] Trial 2 finished with value: 0.20046011219796397 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 37, 'learning_rate': 0.2921819484473355, 'boosting_type': 'dart', 'min_child_samples': 55, 'min_child_weight': 0.036671632089617025, 'subsample': 0.853650786932557, 'colsample_bytree': 0.8595846794229967, 'reg_lambda': 0.5896334785603745, 'reg_alpha': 0.44443686758197776, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 75}. Best is trial 1 with value: 0.2351351626097232.
[I 2024-10-23 16:53:16,192] Trial 3 finished with value: 0.23679878016092668 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 79, 'learning_rate': 0.1477973664871458, 'boosting_type': 'dart', 'min_child_samples': 66, 'min_child_weight': 0.5848943222502431, 'subsample': 0.7578562280665435, 'colsample_bytree': 0.8569013714174842, 'reg_lambda': 0.7830582910962314, 'reg_alpha': 0.47600684644099023, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 69}. Best is trial 3 with value: 0.23679878016092668.
[I 2024-10-23 16:58:54,234] Trial 4 finished with value: 0.19816187937138333 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 35, 'learning_rate': 0.21928653738419931, 'boosting_type': 'dart', 'min_child_samples': 91, 'min_child_weight': 0.6094986845757975, 'subsample': 0.7401054887766454, 'colsample_bytree': 0.8004790468730579, 'reg_lambda': 0.9058436600151478, 'reg_alpha': 0.3303288382494436, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 95}. Best is trial 3 with value: 0.23679878016092668.
[I 2024-10-23 17:00:07,368] A new study created in memory with name: no-name-f7a3d15b-333d-4efb-88f5-8c0a43d8d060
Running LGBM with - MI_features
[I 2024-10-23 17:04:42,174] Trial 0 finished with value: 0.24095179928752009 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 77, 'learning_rate': 0.1376303920077012, 'boosting_type': 'dart', 'min_child_samples': 42, 'min_child_weight': 0.025706201341357374, 'subsample': 0.7212125748952531, 'colsample_bytree': 0.845448028736891, 'reg_lambda': 0.7114604711726275, 'reg_alpha': 0.5264611330673967, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 78}. Best is trial 0 with value: 0.24095179928752009.
[I 2024-10-23 17:09:11,312] Trial 1 finished with value: 0.2591430714091621 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 5, 'num_leaves': 80, 'learning_rate': 0.2204897078016253, 'boosting_type': 'dart', 'min_child_samples': 98, 'min_child_weight': 0.21317550217208076, 'subsample': 0.8213259238637353, 'colsample_bytree': 0.7898302629863433, 'reg_lambda': 0.30281874687342597, 'reg_alpha': 0.7031568672034261, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 90}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:12:51,626] Trial 2 finished with value: 0.22137292794654048 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 37, 'learning_rate': 0.2921819484473355, 'boosting_type': 'dart', 'min_child_samples': 55, 'min_child_weight': 0.036671632089617025, 'subsample': 0.853650786932557, 'colsample_bytree': 0.8595846794229967, 'reg_lambda': 0.5896334785603745, 'reg_alpha': 0.44443686758197776, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 75}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:16:04,539] Trial 3 finished with value: 0.25893959322359505 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 79, 'learning_rate': 0.1477973664871458, 'boosting_type': 'dart', 'min_child_samples': 66, 'min_child_weight': 0.5848943222502431, 'subsample': 0.7578562280665435, 'colsample_bytree': 0.8569013714174842, 'reg_lambda': 0.7830582910962314, 'reg_alpha': 0.47600684644099023, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 69}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:21:44,786] Trial 4 finished with value: 0.22016438352603665 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 35, 'learning_rate': 0.21928653738419931, 'boosting_type': 'dart', 'min_child_samples': 91, 'min_child_weight': 0.6094986845757975, 'subsample': 0.7401054887766454, 'colsample_bytree': 0.8004790468730579, 'reg_lambda': 0.9058436600151478, 'reg_alpha': 0.3303288382494436, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 95}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:23:32,857] A new study created in memory with name: no-name-dd58db87-07d0-4590-8bda-cafa57b5bf1f
----------------------------------------------------

Running CAT with - all_features
[I 2024-10-23 17:25:28,195] Trial 0 finished with value: 0.3748505945661208 and parameters: {'iterations': 84, 'depth': 8, 'learning_rate': 0.12296210782212309, 'l2_leaf_reg': 0.5337047810939617, 'bootstrap_type': 'Bayesian', 'random_strength': 7.299677422476102, 'border_count': 87, 'random_seed': 2024, 'bagging_temperature': 0.4738457034082185}. Best is trial 0 with value: 0.3748505945661208.
[I 2024-10-23 17:26:05,876] Trial 1 finished with value: 0.3626357377698067 and parameters: {'iterations': 78, 'depth': 5, 'learning_rate': 0.2286023354617474, 'l2_leaf_reg': 6.064240536180453, 'bootstrap_type': 'Bayesian', 'random_strength': 6.105633231254895, 'border_count': 58, 'random_seed': 2024, 'bagging_temperature': 0.22535416319269552}. Best is trial 0 with value: 0.3748505945661208.
[I 2024-10-23 17:28:04,527] Trial 2 finished with value: 0.33623335450939446 and parameters: {'iterations': 87, 'depth': 8, 'learning_rate': 0.1327685470388916, 'l2_leaf_reg': 1.045867323217618, 'bootstrap_type': 'Bayesian', 'random_strength': 2.893434682492068, 'border_count': 98, 'random_seed': 2024, 'bagging_temperature': 0.7979233971149834}. Best is trial 0 with value: 0.3748505945661208.
[I 2024-10-23 17:28:48,849] Trial 3 finished with value: 0.3759487722415203 and parameters: {'iterations': 82, 'depth': 6, 'learning_rate': 0.152087590757984, 'l2_leaf_reg': 2.929691145924111, 'bootstrap_type': 'Bayesian', 'random_strength': 4.433444876033651, 'border_count': 113, 'random_seed': 2024, 'bagging_temperature': 0.2892811403327177}. Best is trial 3 with value: 0.3759487722415203.
[I 2024-10-23 17:30:58,296] Trial 4 finished with value: 0.3276478659520506 and parameters: {'iterations': 92, 'depth': 8, 'learning_rate': 0.15824656330287373, 'l2_leaf_reg': 2.335110799201222, 'bootstrap_type': 'Bayesian', 'random_strength': 6.0046836005178665, 'border_count': 107, 'random_seed': 2024, 'bagging_temperature': 0.8924863863290551}. Best is trial 3 with value: 0.3759487722415203.
0:	learn: 1.1325150	total: 402ms	remaining: 32.6s
1:	learn: 0.9737351	total: 759ms	remaining: 30.3s
2:	learn: 0.8614875	total: 1.05s	remaining: 27.8s
3:	learn: 0.7789055	total: 1.31s	remaining: 25.5s
4:	learn: 0.7149616	total: 1.57s	remaining: 24.1s
5:	learn: 0.6607801	total: 1.83s	remaining: 23.2s
6:	learn: 0.6175241	total: 2.04s	remaining: 21.9s
7:	learn: 0.5845120	total: 2.33s	remaining: 21.5s
8:	learn: 0.5539147	total: 2.59s	remaining: 21s
9:	learn: 0.5277306	total: 2.93s	remaining: 21.1s
10:	learn: 0.5054359	total: 3.24s	remaining: 20.9s
11:	learn: 0.4863717	total: 3.49s	remaining: 20.3s
12:	learn: 0.4696365	total: 3.69s	remaining: 19.6s
13:	learn: 0.4546255	total: 3.96s	remaining: 19.2s
14:	learn: 0.4438541	total: 4.21s	remaining: 18.8s
15:	learn: 0.4286878	total: 4.44s	remaining: 18.3s
16:	learn: 0.4176373	total: 4.67s	remaining: 17.9s
17:	learn: 0.4095670	total: 4.86s	remaining: 17.3s
18:	learn: 0.4035267	total: 5.08s	remaining: 16.8s
19:	learn: 0.3965299	total: 5.36s	remaining: 16.6s
20:	learn: 0.3881775	total: 5.65s	remaining: 16.4s
21:	learn: 0.3801253	total: 5.98s	remaining: 16.3s
22:	learn: 0.3758483	total: 6.25s	remaining: 16s
23:	learn: 0.3705588	total: 6.52s	remaining: 15.8s
24:	learn: 0.3638557	total: 6.81s	remaining: 15.5s
25:	learn: 0.3583066	total: 7.02s	remaining: 15.1s
26:	learn: 0.3545156	total: 7.19s	remaining: 14.6s
27:	learn: 0.3506168	total: 7.44s	remaining: 14.3s
28:	learn: 0.3482962	total: 7.67s	remaining: 14s
29:	learn: 0.3425738	total: 7.94s	remaining: 13.8s
30:	learn: 0.3369936	total: 8.2s	remaining: 13.5s
31:	learn: 0.3339771	total: 8.42s	remaining: 13.2s
32:	learn: 0.3303205	total: 8.65s	remaining: 12.8s
33:	learn: 0.3276974	total: 8.89s	remaining: 12.5s
34:	learn: 0.3249905	total: 9.09s	remaining: 12.2s
35:	learn: 0.3228160	total: 9.29s	remaining: 11.9s
36:	learn: 0.3195114	total: 9.5s	remaining: 11.6s
37:	learn: 0.3166795	total: 9.75s	remaining: 11.3s
38:	learn: 0.3127664	total: 10s	remaining: 11s
39:	learn: 0.3105132	total: 10.2s	remaining: 10.8s
40:	learn: 0.3082892	total: 10.5s	remaining: 10.5s
41:	learn: 0.3060667	total: 10.8s	remaining: 10.3s
42:	learn: 0.3026135	total: 11.1s	remaining: 10.1s
43:	learn: 0.2997614	total: 11.5s	remaining: 9.96s
44:	learn: 0.2969568	total: 11.8s	remaining: 9.73s
45:	learn: 0.2957830	total: 12.1s	remaining: 9.46s
46:	learn: 0.2947311	total: 12.3s	remaining: 9.13s
47:	learn: 0.2929072	total: 12.5s	remaining: 8.82s
48:	learn: 0.2914847	total: 12.7s	remaining: 8.55s
49:	learn: 0.2891016	total: 13s	remaining: 8.31s
50:	learn: 0.2881824	total: 13.2s	remaining: 8.01s
51:	learn: 0.2868008	total: 13.4s	remaining: 7.72s
52:	learn: 0.2856366	total: 13.6s	remaining: 7.43s
53:	learn: 0.2840025	total: 13.8s	remaining: 7.18s
54:	learn: 0.2827444	total: 14s	remaining: 6.89s
55:	learn: 0.2818614	total: 14.3s	remaining: 6.63s
56:	learn: 0.2798854	total: 14.5s	remaining: 6.36s
57:	learn: 0.2790643	total: 14.7s	remaining: 6.09s
58:	learn: 0.2778777	total: 14.9s	remaining: 5.82s
59:	learn: 0.2765767	total: 15.2s	remaining: 5.55s
60:	learn: 0.2741460	total: 15.4s	remaining: 5.32s
61:	learn: 0.2732513	total: 15.6s	remaining: 5.04s
62:	learn: 0.2716021	total: 15.9s	remaining: 4.79s
63:	learn: 0.2693366	total: 16.1s	remaining: 4.53s
64:	learn: 0.2683629	total: 16.3s	remaining: 4.26s
65:	learn: 0.2674920	total: 16.5s	remaining: 4s
66:	learn: 0.2657599	total: 16.8s	remaining: 3.75s
67:	learn: 0.2648485	total: 17s	remaining: 3.51s
68:	learn: 0.2638311	total: 17.3s	remaining: 3.25s
69:	learn: 0.2628448	total: 17.5s	remaining: 3s
70:	learn: 0.2623515	total: 17.8s	remaining: 2.75s
71:	learn: 0.2612747	total: 18s	remaining: 2.5s
72:	learn: 0.2602216	total: 18.2s	remaining: 2.25s
73:	learn: 0.2592721	total: 18.4s	remaining: 1.99s
74:	learn: 0.2583134	total: 18.7s	remaining: 1.74s
75:	learn: 0.2570396	total: 18.9s	remaining: 1.49s
76:	learn: 0.2562873	total: 19.2s	remaining: 1.24s
77:	learn: 0.2559071	total: 19.4s	remaining: 993ms
78:	learn: 0.2548139	total: 19.6s	remaining: 743ms
79:	learn: 0.2533138	total: 19.8s	remaining: 495ms
80:	learn: 0.2520942	total: 20.1s	remaining: 248ms
81:	learn: 0.2512924	total: 20.4s	remaining: 0us
[I 2024-10-23 17:31:20,269] A new study created in memory with name: no-name-bd296dc2-4441-4249-a319-9c5e0bbc9120
Running CAT with - Boruta_features
[I 2024-10-23 17:32:49,271] Trial 0 finished with value: 0.38440707201648483 and parameters: {'iterations': 84, 'depth': 8, 'learning_rate': 0.12296210782212309, 'l2_leaf_reg': 0.5337047810939617, 'bootstrap_type': 'Bayesian', 'random_strength': 7.299677422476102, 'border_count': 87, 'random_seed': 2024, 'bagging_temperature': 0.4738457034082185}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:33:23,262] Trial 1 finished with value: 0.3678413147143959 and parameters: {'iterations': 78, 'depth': 5, 'learning_rate': 0.2286023354617474, 'l2_leaf_reg': 6.064240536180453, 'bootstrap_type': 'Bayesian', 'random_strength': 6.105633231254895, 'border_count': 58, 'random_seed': 2024, 'bagging_temperature': 0.22535416319269552}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:34:57,372] Trial 2 finished with value: 0.34560453284100395 and parameters: {'iterations': 87, 'depth': 8, 'learning_rate': 0.1327685470388916, 'l2_leaf_reg': 1.045867323217618, 'bootstrap_type': 'Bayesian', 'random_strength': 2.893434682492068, 'border_count': 98, 'random_seed': 2024, 'bagging_temperature': 0.7979233971149834}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:35:36,595] Trial 3 finished with value: 0.3812228767182266 and parameters: {'iterations': 82, 'depth': 6, 'learning_rate': 0.152087590757984, 'l2_leaf_reg': 2.929691145924111, 'bootstrap_type': 'Bayesian', 'random_strength': 4.433444876033651, 'border_count': 113, 'random_seed': 2024, 'bagging_temperature': 0.2892811403327177}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:37:57,313] Trial 4 finished with value: 0.3361079064881016 and parameters: {'iterations': 92, 'depth': 8, 'learning_rate': 0.15824656330287373, 'l2_leaf_reg': 2.335110799201222, 'bootstrap_type': 'Bayesian', 'random_strength': 6.0046836005178665, 'border_count': 107, 'random_seed': 2024, 'bagging_temperature': 0.8924863863290551}. Best is trial 0 with value: 0.38440707201648483.
0:	learn: 1.1677280	total: 538ms	remaining: 44.6s
1:	learn: 1.0233258	total: 1.12s	remaining: 45.8s
2:	learn: 0.9224457	total: 1.6s	remaining: 43.1s
3:	learn: 0.8352970	total: 2.22s	remaining: 44.3s
4:	learn: 0.7686503	total: 2.86s	remaining: 45.2s
5:	learn: 0.7108422	total: 3.61s	remaining: 47s
6:	learn: 0.6622692	total: 4.23s	remaining: 46.5s
7:	learn: 0.6175089	total: 4.81s	remaining: 45.7s
8:	learn: 0.5833032	total: 5.4s	remaining: 45s
9:	learn: 0.5526761	total: 5.99s	remaining: 44.3s
10:	learn: 0.5284835	total: 6.58s	remaining: 43.7s
11:	learn: 0.5088620	total: 7.24s	remaining: 43.4s
12:	learn: 0.4912928	total: 7.85s	remaining: 42.9s
13:	learn: 0.4739903	total: 8.46s	remaining: 42.3s
14:	learn: 0.4575285	total: 9.04s	remaining: 41.6s
15:	learn: 0.4465080	total: 9.62s	remaining: 40.9s
16:	learn: 0.4337611	total: 10.3s	remaining: 40.5s
17:	learn: 0.4218666	total: 10.8s	remaining: 39.7s
18:	learn: 0.4107497	total: 11.4s	remaining: 39s
19:	learn: 0.4018927	total: 12s	remaining: 38.4s
20:	learn: 0.3935956	total: 12.6s	remaining: 37.7s
21:	learn: 0.3854994	total: 13.1s	remaining: 37s
22:	learn: 0.3794979	total: 13.7s	remaining: 36.3s
23:	learn: 0.3720972	total: 14.3s	remaining: 35.7s
24:	learn: 0.3636630	total: 15s	remaining: 35.4s
25:	learn: 0.3577149	total: 15.7s	remaining: 35s
26:	learn: 0.3547073	total: 16.4s	remaining: 34.6s
27:	learn: 0.3498429	total: 17.1s	remaining: 34.1s
28:	learn: 0.3454957	total: 17.8s	remaining: 33.8s
29:	learn: 0.3425081	total: 18.5s	remaining: 33.4s
30:	learn: 0.3397684	total: 19.4s	remaining: 33.2s
31:	learn: 0.3341317	total: 20.5s	remaining: 33.3s
32:	learn: 0.3300049	total: 21.2s	remaining: 32.8s
33:	learn: 0.3275224	total: 21.8s	remaining: 32.1s
34:	learn: 0.3238664	total: 22.4s	remaining: 31.4s
35:	learn: 0.3202316	total: 23s	remaining: 30.7s
36:	learn: 0.3179103	total: 23.6s	remaining: 29.9s
37:	learn: 0.3154047	total: 24.1s	remaining: 29.2s
38:	learn: 0.3135254	total: 24.7s	remaining: 28.5s
39:	learn: 0.3118519	total: 25.4s	remaining: 27.9s
40:	learn: 0.3091185	total: 26.1s	remaining: 27.3s
41:	learn: 0.3048808	total: 26.8s	remaining: 26.8s
42:	learn: 0.3029744	total: 27.5s	remaining: 26.2s
43:	learn: 0.3002069	total: 28.2s	remaining: 25.6s
44:	learn: 0.2963878	total: 29s	remaining: 25.1s
45:	learn: 0.2955125	total: 29.5s	remaining: 24.4s
46:	learn: 0.2950047	total: 29.9s	remaining: 23.5s
47:	learn: 0.2940622	total: 30.5s	remaining: 22.9s
48:	learn: 0.2924197	total: 31.1s	remaining: 22.2s
49:	learn: 0.2911538	total: 31.6s	remaining: 21.5s
50:	learn: 0.2891470	total: 32.2s	remaining: 20.8s
51:	learn: 0.2879441	total: 32.8s	remaining: 20.2s
52:	learn: 0.2861317	total: 33.4s	remaining: 19.5s
53:	learn: 0.2854343	total: 34s	remaining: 18.9s
54:	learn: 0.2839754	total: 34.5s	remaining: 18.2s
55:	learn: 0.2826004	total: 35.1s	remaining: 17.6s
56:	learn: 0.2809507	total: 36.3s	remaining: 17.2s
57:	learn: 0.2801611	total: 37.2s	remaining: 16.7s
58:	learn: 0.2778727	total: 37.9s	remaining: 16.1s
59:	learn: 0.2769787	total: 38.5s	remaining: 15.4s
60:	learn: 0.2754694	total: 39.3s	remaining: 14.8s
61:	learn: 0.2744866	total: 40.2s	remaining: 14.3s
62:	learn: 0.2733779	total: 40.8s	remaining: 13.6s
63:	learn: 0.2723974	total: 41.3s	remaining: 12.9s
64:	learn: 0.2711374	total: 42s	remaining: 12.3s
65:	learn: 0.2692392	total: 42.9s	remaining: 11.7s
66:	learn: 0.2679932	total: 43.5s	remaining: 11s
67:	learn: 0.2675762	total: 44.1s	remaining: 10.4s
68:	learn: 0.2664956	total: 44.6s	remaining: 9.7s
69:	learn: 0.2654207	total: 45.4s	remaining: 9.08s
70:	learn: 0.2644709	total: 46.2s	remaining: 8.46s
71:	learn: 0.2639310	total: 46.9s	remaining: 7.82s
72:	learn: 0.2632427	total: 47.5s	remaining: 7.16s
73:	learn: 0.2621599	total: 48.3s	remaining: 6.53s
74:	learn: 0.2612547	total: 49.2s	remaining: 5.9s
75:	learn: 0.2606603	total: 49.8s	remaining: 5.24s
76:	learn: 0.2602171	total: 50.4s	remaining: 4.58s
77:	learn: 0.2593541	total: 51.1s	remaining: 3.93s
78:	learn: 0.2586285	total: 52s	remaining: 3.29s
79:	learn: 0.2567037	total: 53.5s	remaining: 2.67s
80:	learn: 0.2559769	total: 54.5s	remaining: 2.02s
81:	learn: 0.2548979	total: 55.4s	remaining: 1.35s
82:	learn: 0.2541284	total: 56s	remaining: 675ms
83:	learn: 0.2536267	total: 56.6s	remaining: 0us
[I 2024-10-23 17:38:55,136] A new study created in memory with name: no-name-acf218df-fda9-43c9-9df2-3053e8ce69b7
Running CAT with - MI_features
[I 2024-10-23 17:40:48,543] Trial 0 finished with value: 0.40825074006423584 and parameters: {'iterations': 84, 'depth': 8, 'learning_rate': 0.12296210782212309, 'l2_leaf_reg': 0.5337047810939617, 'bootstrap_type': 'Bayesian', 'random_strength': 7.299677422476102, 'border_count': 87, 'random_seed': 2024, 'bagging_temperature': 0.4738457034082185}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:41:32,099] Trial 1 finished with value: 0.39577163823975137 and parameters: {'iterations': 78, 'depth': 5, 'learning_rate': 0.2286023354617474, 'l2_leaf_reg': 6.064240536180453, 'bootstrap_type': 'Bayesian', 'random_strength': 6.105633231254895, 'border_count': 58, 'random_seed': 2024, 'bagging_temperature': 0.22535416319269552}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:43:22,633] Trial 2 finished with value: 0.36876867748199216 and parameters: {'iterations': 87, 'depth': 8, 'learning_rate': 0.1327685470388916, 'l2_leaf_reg': 1.045867323217618, 'bootstrap_type': 'Bayesian', 'random_strength': 2.893434682492068, 'border_count': 98, 'random_seed': 2024, 'bagging_temperature': 0.7979233971149834}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:44:05,584] Trial 3 finished with value: 0.40728893721284903 and parameters: {'iterations': 82, 'depth': 6, 'learning_rate': 0.152087590757984, 'l2_leaf_reg': 2.929691145924111, 'bootstrap_type': 'Bayesian', 'random_strength': 4.433444876033651, 'border_count': 113, 'random_seed': 2024, 'bagging_temperature': 0.2892811403327177}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:46:10,647] Trial 4 finished with value: 0.3612604859095497 and parameters: {'iterations': 92, 'depth': 8, 'learning_rate': 0.15824656330287373, 'l2_leaf_reg': 2.335110799201222, 'bootstrap_type': 'Bayesian', 'random_strength': 6.0046836005178665, 'border_count': 107, 'random_seed': 2024, 'bagging_temperature': 0.8924863863290551}. Best is trial 0 with value: 0.40825074006423584.
0:	learn: 1.1730609	total: 486ms	remaining: 40.3s
1:	learn: 1.0342067	total: 949ms	remaining: 38.9s
2:	learn: 0.9247563	total: 1.48s	remaining: 39.9s
3:	learn: 0.8457534	total: 2.09s	remaining: 41.9s
4:	learn: 0.7796354	total: 2.66s	remaining: 42.1s
5:	learn: 0.7290371	total: 3.19s	remaining: 41.5s
6:	learn: 0.6850044	total: 3.71s	remaining: 40.8s
7:	learn: 0.6478551	total: 4.17s	remaining: 39.6s
8:	learn: 0.6138447	total: 4.65s	remaining: 38.8s
9:	learn: 0.5842388	total: 5.19s	remaining: 38.4s
10:	learn: 0.5664373	total: 5.55s	remaining: 36.8s
11:	learn: 0.5420530	total: 6.08s	remaining: 36.5s
12:	learn: 0.5215885	total: 6.54s	remaining: 35.7s
13:	learn: 0.5042014	total: 7.01s	remaining: 35.1s
14:	learn: 0.4867716	total: 7.5s	remaining: 34.5s
15:	learn: 0.4715421	total: 7.98s	remaining: 33.9s
16:	learn: 0.4562622	total: 8.51s	remaining: 33.5s
17:	learn: 0.4474551	total: 9s	remaining: 33s
18:	learn: 0.4359403	total: 9.46s	remaining: 32.4s
19:	learn: 0.4272344	total: 9.95s	remaining: 31.8s
20:	learn: 0.4190076	total: 10.4s	remaining: 31.3s
21:	learn: 0.4132334	total: 10.9s	remaining: 30.8s
22:	learn: 0.4021469	total: 11.4s	remaining: 30.2s
23:	learn: 0.3935058	total: 11.9s	remaining: 29.7s
24:	learn: 0.3867254	total: 12.4s	remaining: 29.4s
25:	learn: 0.3808787	total: 12.9s	remaining: 28.8s
26:	learn: 0.3756339	total: 13.4s	remaining: 28.3s
27:	learn: 0.3694744	total: 13.9s	remaining: 27.8s
28:	learn: 0.3642195	total: 14.4s	remaining: 27.3s
29:	learn: 0.3631619	total: 14.8s	remaining: 26.7s
30:	learn: 0.3584025	total: 15.4s	remaining: 26.3s
31:	learn: 0.3536884	total: 15.9s	remaining: 25.9s
32:	learn: 0.3511280	total: 16.4s	remaining: 25.3s
33:	learn: 0.3466364	total: 16.8s	remaining: 24.8s
34:	learn: 0.3433440	total: 17.3s	remaining: 24.2s
35:	learn: 0.3408153	total: 17.8s	remaining: 23.7s
36:	learn: 0.3391054	total: 18.3s	remaining: 23.2s
37:	learn: 0.3379596	total: 18.8s	remaining: 22.7s
38:	learn: 0.3350491	total: 19.3s	remaining: 22.2s
39:	learn: 0.3331631	total: 19.8s	remaining: 21.7s
40:	learn: 0.3311541	total: 20.2s	remaining: 21.2s
41:	learn: 0.3282978	total: 20.7s	remaining: 20.7s
42:	learn: 0.3250999	total: 21.2s	remaining: 20.2s
43:	learn: 0.3233945	total: 21.7s	remaining: 19.7s
44:	learn: 0.3206411	total: 22.2s	remaining: 19.3s
45:	learn: 0.3197494	total: 22.8s	remaining: 18.8s
46:	learn: 0.3182101	total: 23.3s	remaining: 18.3s
47:	learn: 0.3168213	total: 23.8s	remaining: 17.8s
48:	learn: 0.3155018	total: 24.3s	remaining: 17.3s
49:	learn: 0.3131452	total: 24.7s	remaining: 16.8s
50:	learn: 0.3119475	total: 25.2s	remaining: 16.3s
51:	learn: 0.3106635	total: 25.7s	remaining: 15.8s
52:	learn: 0.3095831	total: 26.4s	remaining: 15.4s
53:	learn: 0.3079553	total: 27.1s	remaining: 15.1s
54:	learn: 0.3077176	total: 27.7s	remaining: 14.6s
55:	learn: 0.3069242	total: 28.2s	remaining: 14.1s
56:	learn: 0.3058729	total: 28.7s	remaining: 13.6s
57:	learn: 0.3046885	total: 29.3s	remaining: 13.1s
58:	learn: 0.3041026	total: 29.8s	remaining: 12.6s
59:	learn: 0.3028448	total: 30.3s	remaining: 12.1s
60:	learn: 0.3016953	total: 30.9s	remaining: 11.7s
61:	learn: 0.3008269	total: 31.5s	remaining: 11.2s
62:	learn: 0.3002449	total: 32.1s	remaining: 10.7s
63:	learn: 0.2995859	total: 32.6s	remaining: 10.2s
64:	learn: 0.2980209	total: 33.1s	remaining: 9.69s
65:	learn: 0.2972820	total: 33.6s	remaining: 9.17s
66:	learn: 0.2953970	total: 34.1s	remaining: 8.65s
67:	learn: 0.2943505	total: 34.6s	remaining: 8.13s
68:	learn: 0.2938072	total: 35s	remaining: 7.62s
69:	learn: 0.2927706	total: 35.6s	remaining: 7.11s
70:	learn: 0.2908826	total: 36s	remaining: 6.6s
71:	learn: 0.2886229	total: 36.6s	remaining: 6.1s
72:	learn: 0.2878116	total: 37.2s	remaining: 5.6s
73:	learn: 0.2859188	total: 37.7s	remaining: 5.09s
74:	learn: 0.2843453	total: 38.2s	remaining: 4.58s
75:	learn: 0.2839863	total: 38.8s	remaining: 4.09s
76:	learn: 0.2824572	total: 39.5s	remaining: 3.6s
77:	learn: 0.2815902	total: 40.1s	remaining: 3.08s
78:	learn: 0.2812899	total: 40.6s	remaining: 2.57s
79:	learn: 0.2800560	total: 41.1s	remaining: 2.06s
80:	learn: 0.2785200	total: 41.6s	remaining: 1.54s
81:	learn: 0.2781103	total: 42.4s	remaining: 1.03s
82:	learn: 0.2775323	total: 43.1s	remaining: 519ms
83:	learn: 0.2771441	total: 43.6s	remaining: 0us

----------------------------------------------------

CPU times: total: 1d 5h 8min 44s
Wall time: 8h 15min 7s
In [548]:
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'TimeElapsed', 'Optuna_TimeElapsed']
for metric in metrics:
    data = []    
    for technique in feature_set_result:
        for model in feature_set_result[technique]:
            data.append({
                'Feature_Set': technique,
                'Model': model,
                'Score': feature_set_result[technique][model][metric]
            })
    metric_df = pd.DataFrame(data)

    plt.figure(figsize=(10, 6))
    sns.barplot(x='Feature_Set', y='Score', hue='Model', data=metric_df, palette='Set2')

    if metric == 'TimeElapsed':
        plt.title(f'Comparison of Time Elapsed Across Models and Feature Sets')
        plt.ylabel('Time Elapsed (seconds)')
    elif metric == 'Optuna_TimeElapsed':
        plt.title(f'Comparison of Optuna Time Elapsed Across Models and Feature Sets')
        plt.ylabel('Optuna Time Elapsed (seconds)')        
    else:
        plt.title(f'Comparison of {metric} Across Models and Feature Sets')
        plt.ylabel(f'{metric.capitalize()} Score')
        
    plt.xlabel('Feature Selection Technique')

    for p in plt.gca().patches:
        plt.gca().annotate(f'{p.get_height():.2f}', 
                            (p.get_x() + p.get_width() / 2., p.get_height()), 
                           ha='center', va='baseline', fontsize=12, color='black', 
                           xytext=(0, 5), textcoords='offset points')

    plt.legend(title='Model')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

feature_importances¶


In [549]:
num_models = len(best_models)
num_techniques = len(feature_selection_techniques)
total_plots = num_models * num_techniques

rows = (total_plots // 3) + (total_plots % 3 > 0)
cols = min(total_plots, 3)

fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(18, 12)) 

ax = ax.flatten()

plot_idx = 0

for model in best_models:
    for technique in feature_selection_techniques:
        features = feature_selection_techniques[technique]  
        importances_score = best_models[model][technique].feature_importances_
    
        df_feature_importances = pd.Series(index=features, data=importances_score)
        df_feature_importances = df_feature_importances.sort_values(ascending=False)
        
        sns.barplot(x=df_feature_importances.index, y=df_feature_importances.values, ax=ax[plot_idx])
        
        ax[plot_idx].set_title(f"Feature Importance - {model} - {technique}")
        ax[plot_idx].set_xlabel('Features')
        ax[plot_idx].set_ylabel('Importance')
        ax[plot_idx].tick_params(axis='x', rotation=90) 
        
        plot_idx += 1  

plt.tight_layout()
plt.show()
No description has been provided for this image

Summary of Models Evaluation Results¶

XGBoost¶

XGBoost consistently achieved high performance across different feature sets, with accuracy, precision, recall, and F1-score all exceeding 0.9 on the validation data. This indicates its strong ability to generalize and handle the classification task effectively. While XGBoost required more time to train compared to other models, the performance gains make it the best option.

LightGBM¶

LightGBM also performed well, delivering solid results with high accuracy and recall. Its shorter training time makes it a fast alternative, though it did not quite reach the same level of performance as XGBoost.

CatBoost¶

CatBoost had the shortest training time, but its overall performance lagged behind both XGBoost and LightGBM, making it the least favorable model for this task.


Conclusion¶

XGBoost with Boruta features proved to be the most reliable and effective model, surpassing 0.9 in accuracy, precision, recall, and F1-score across the validation data. Given these results, we will proceed to retrain XGBoost with Boruta features using Optuna, increasing the number of trials (20) to further fine-tune and optimize its performance.

In [570]:
%%time
training_time = {}
trials = 20
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE))
start_time = time.time()
study.optimize(lambda trial: objective(trial, X_train[boruta_selected_features], y_train, xgb.XGBClassifier()), n_trials=trials, n_jobs=1)
end_time = time.time()
training_time['Optuna'] = end_time - start_time

best_params   = study.best_trial.params        
best_model    = xgb.XGBClassifier(**best_params)
sample_weight = compute_sample_weight('balanced', y_train)
        
start_time = time.time()
best_model.fit(X_train[boruta_selected_features], y_train, sample_weight=sample_weight)
end_time = time.time()
training_time[ "Fitting model"] = end_time - start_time

y_pred = best_model.predict(X_test[boruta_selected_features])
[I 2024-10-24 17:17:04,056] A new study created in memory with name: no-name-c66d1984-342c-4086-9688-0f70ad9277bc
[I 2024-10-24 17:31:03,085] Trial 0 finished with value: 0.38226456908590534 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.38226456908590534.
[I 2024-10-24 17:54:58,355] Trial 1 finished with value: 0.9881188472914543 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 18:13:13,950] Trial 2 finished with value: 0.9858997093684756 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 18:26:11,550] Trial 3 finished with value: 0.2650465790535372 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 18:41:03,595] Trial 4 finished with value: 0.2909791570453547 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 19:09:22,592] Trial 5 finished with value: 0.25456705874655905 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 88, 'max_depth': 9, 'learning_rate': 0.16228573177239408, 'subsample': 0.8010461081698292, 'colsample_bynode': 0.8698007575766158, 'colsample_bylevel': 0.7587031265395159, 'colsample_bytree': 0.8354239101309623, 'reg_lambda': 0.4788157619774831, 'reg_alpha': 0.7135414422664076, 'min_child_weight': 1.0530699737162201, 'max_delta_step': 1.8724942460821676, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 19:29:44,313] Trial 6 finished with value: 0.9878445459689275 and parameters: {'eval_metric': 'auc', 'n_estimators': 80, 'max_depth': 6, 'learning_rate': 0.2575041621200346, 'subsample': 0.7337738615129971, 'colsample_bynode': 0.8172717227979691, 'colsample_bylevel': 0.7862421341431735, 'colsample_bytree': 0.7123820370391638, 'reg_lambda': 0.3605092972246944, 'reg_alpha': 0.7607308585115216, 'min_child_weight': 1.2216386373611865, 'max_delta_step': 1.475995285326252, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 19:45:35,667] Trial 7 finished with value: 0.3024047377452335 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 73, 'max_depth': 5, 'learning_rate': 0.24233284318934117, 'subsample': 0.8745417286181747, 'colsample_bynode': 0.8186312730850103, 'colsample_bylevel': 0.8389425757772428, 'colsample_bytree': 0.7346466640586605, 'reg_lambda': 0.5785593330742154, 'reg_alpha': 0.8838627585910562, 'min_child_weight': 2.6027256701341583, 'max_delta_step': 2.930138850054874, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 20:09:27,971] Trial 8 finished with value: 0.9893716866292657 and parameters: {'eval_metric': 'auc', 'n_estimators': 85, 'max_depth': 7, 'learning_rate': 0.22332739921646455, 'subsample': 0.7279866486583308, 'colsample_bynode': 0.7822471632439039, 'colsample_bylevel': 0.8555260678875416, 'colsample_bytree': 0.8879451039217905, 'reg_lambda': 0.19412146736509212, 'reg_alpha': 0.9446339775786664, 'min_child_weight': 2.4934679292732183, 'max_delta_step': 1.327006796682491, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 8 with value: 0.9893716866292657.
[I 2024-10-24 20:26:16,472] Trial 9 finished with value: 0.9912147694546469 and parameters: {'eval_metric': 'auc', 'n_estimators': 67, 'max_depth': 9, 'learning_rate': 0.26401129744062757, 'subsample': 0.8473317755205418, 'colsample_bynode': 0.8693936461004659, 'colsample_bylevel': 0.8218231247532671, 'colsample_bytree': 0.7688466011255816, 'reg_lambda': 0.30670209229874446, 'reg_alpha': 0.9501632096075534, 'min_child_weight': 1.2293142773561567, 'max_delta_step': 1.5251114649517818, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 9 with value: 0.9912147694546469.
[I 2024-10-24 20:58:15,614] Trial 10 finished with value: 0.9909616701341819 and parameters: {'eval_metric': 'auc', 'n_estimators': 93, 'max_depth': 9, 'learning_rate': 0.2988163948036256, 'subsample': 0.7744075991243188, 'colsample_bynode': 0.8858058991322473, 'colsample_bylevel': 0.7027196288800658, 'colsample_bytree': 0.7626254307744855, 'reg_lambda': 0.7099610989707028, 'reg_alpha': 0.5373602864876568, 'min_child_weight': 1.8577642969950348, 'max_delta_step': 0.5129580393129478, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 9 with value: 0.9912147694546469.
[I 2024-10-24 21:33:28,591] Trial 11 finished with value: 0.9912221691037868 and parameters: {'eval_metric': 'auc', 'n_estimators': 98, 'max_depth': 9, 'learning_rate': 0.29487506519706436, 'subsample': 0.7810327727954102, 'colsample_bynode': 0.898120876774128, 'colsample_bylevel': 0.7176875810430865, 'colsample_bytree': 0.7630714493366861, 'reg_lambda': 0.7053144541694071, 'reg_alpha': 0.5136639848280791, 'min_child_weight': 1.915462286347568, 'max_delta_step': 0.5509085497851001, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 22:07:58,861] Trial 12 finished with value: 0.9903743193075278 and parameters: {'eval_metric': 'auc', 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.199379578399092, 'subsample': 0.7865671896440775, 'colsample_bynode': 0.8998669596216515, 'colsample_bylevel': 0.7013775669664997, 'colsample_bytree': 0.8143973347495697, 'reg_lambda': 0.7475276287469181, 'reg_alpha': 0.5533626777339511, 'min_child_weight': 1.7950770624471184, 'max_delta_step': 0.769348905646069, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 22:45:39,736] Trial 13 finished with value: 0.9912158449453436 and parameters: {'eval_metric': 'auc', 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.29236361038558345, 'subsample': 0.8317135778689105, 'colsample_bynode': 0.8653307844298959, 'colsample_bylevel': 0.7488044557291498, 'colsample_bytree': 0.7559319450726714, 'reg_lambda': 0.8300180530697104, 'reg_alpha': 0.5638257439440392, 'min_child_weight': 1.5184440048406134, 'max_delta_step': 1.0533040953204575, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 23:20:42,343] Trial 14 finished with value: 0.9910655071328601 and parameters: {'eval_metric': 'auc', 'n_estimators': 98, 'max_depth': 8, 'learning_rate': 0.29935544761584315, 'subsample': 0.7614200118462342, 'colsample_bynode': 0.8532107311224439, 'colsample_bylevel': 0.7398541239598725, 'colsample_bytree': 0.7443477858733684, 'reg_lambda': 0.9343818529072998, 'reg_alpha': 0.5630358087685485, 'min_child_weight': 2.0989075635633436, 'max_delta_step': 0.9690187706405073, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 23:52:52,318] Trial 15 finished with value: 0.9902917467792978 and parameters: {'eval_metric': 'auc', 'n_estimators': 93, 'max_depth': 8, 'learning_rate': 0.20375529038977508, 'subsample': 0.8264909873659759, 'colsample_bynode': 0.7061908049104075, 'colsample_bylevel': 0.7293990815637788, 'colsample_bytree': 0.7996337930861529, 'reg_lambda': 0.7911814717451944, 'reg_alpha': 0.3946423969940382, 'min_child_weight': 1.514977742714226, 'max_delta_step': 1.0402396964053904, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 00:25:55,300] Trial 16 finished with value: 0.9897669933211359 and parameters: {'eval_metric': 'auc', 'n_estimators': 96, 'max_depth': 8, 'learning_rate': 0.23022492604072695, 'subsample': 0.8143583682434862, 'colsample_bynode': 0.8971797874115994, 'colsample_bylevel': 0.7214688435847043, 'colsample_bytree': 0.7015448495143646, 'reg_lambda': 0.8517762135639828, 'reg_alpha': 0.4089027486623522, 'min_child_weight': 2.0478106560267944, 'max_delta_step': 0.5257675996988145, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 00:58:12,410] Trial 17 finished with value: 0.9899366324375621 and parameters: {'eval_metric': 'auc', 'n_estimators': 92, 'max_depth': 9, 'learning_rate': 0.1000604781764893, 'subsample': 0.7531881336011558, 'colsample_bynode': 0.7890525105838824, 'colsample_bylevel': 0.7634305451882369, 'colsample_bytree': 0.838646311730019, 'reg_lambda': 0.6321978979681996, 'reg_alpha': 0.632063136342995, 'min_child_weight': 2.9738999962143233, 'max_delta_step': 1.1353190836352227, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 01:32:55,754] Trial 18 finished with value: 0.9909682511428506 and parameters: {'eval_metric': 'auc', 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.2846078484674374, 'subsample': 0.7988982500200076, 'colsample_bynode': 0.823377895365234, 'colsample_bylevel': 0.7182087460288243, 'colsample_bytree': 0.7470669008123725, 'reg_lambda': 0.8577366803218225, 'reg_alpha': 0.20932022345287155, 'min_child_weight': 1.522652889206766, 'max_delta_step': 0.7564213964787009, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 02:02:33,828] Trial 19 finished with value: 0.9910225704386083 and parameters: {'eval_metric': 'auc', 'n_estimators': 89, 'max_depth': 9, 'learning_rate': 0.19820314479285167, 'subsample': 0.8533803240374401, 'colsample_bynode': 0.8741891955562361, 'colsample_bylevel': 0.7741847956439751, 'colsample_bytree': 0.7763589961184837, 'reg_lambda': 0.6656043139922608, 'reg_alpha': 0.4277981853793561, 'min_child_weight': 0.846094423419322, 'max_delta_step': 0.8027908456686832, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
CPU times: total: 1d 13h 6min 48s
Wall time: 11h 35s
In [571]:
check_accuracy(y_test, y_pred, "XGB-model with boruta_selected_features and 20 n_trials")

sns.barplot(training_time, palette='Set2')
plt.title("Training  Time in seconds")
plt.show()
No description has been provided for this image
No description has been provided for this image

Model Comparison with Optuna Hyperparameter Tuning¶

After rerunning XGBoost with Boruta-selected features and Optuna hyperparameter optimization over 20 trials, the model produced excellent results:

  • Accuracy: 0.95
  • Precision: 0.92
  • Recall: 0.95
  • F1-Score: 0.93

Despite increasing trials from 5 to 20, performance improvements were marginal, and the extended tuning time reached 11 hours without substantial additional gains.

Confusion Matrix Insights¶

Analysis of the confusion matrix reveals that most misclassifications occur within the "phishing" class, particularly:

  • Phishing URLs are sometimes misclassified as "benign"
  • Phishing URLs are occasionally mislabeled as "malware"

These patterns suggest that while the model generally performs well, additional feature engineering or targeted optimization may be beneficial to improve its ability to distinguish phishing URLs from other classes, specifically benign URLs and malware.


This summary offers insights for selecting LightGBM for faster tuning without major performance compromises, while identifying opportunities for further improvements in phishing URL detection.

Interpretability with SHAP (SHapley Additive exPlanations)¶

What is SHAP?¶

SHAP is a unified framework for interpreting the output of machine learning models. It provides insight into how each feature in the model contributes to the final prediction, using the concept of Shapley values from cooperative game theory.

Shapley Values:¶

In game theory, Shapley values represent the fair distribution of a reward among players based on their contributions. In the context of machine learning, the “players” are the features of the model, and SHAP assigns each feature a value that reflects its contribution to the model’s prediction.

Why Use SHAP?¶

  1. Explainability: SHAP provides clear explanations of how features contribute to predictions, both globally (across all predictions) and locally (for individual predictions).

  2. Flexibility: SHAP works with various machine learning models (tree-based models, neural networks, etc.) and is model-agnostic, providing a unified framework for understanding feature importance.

  3. Feature Engineering:

    • SHAP can help uncover which features are truly important and which features might be redundant or irrelevant.
    • By identifying feature interactions and analyzing the importance of different feature combinations, SHAP can guide the creation of new, meaningful features or the removal of noisy ones.
    • It highlights features that have non-linear impacts on predictions, revealing areas where additional transformations or domain-specific knowledge could improve model performance.
  4. Trust and Compliance: By understanding why a model makes certain predictions, users can trust the model more. For regulated industries like healthcare and finance, SHAP provides explanations to meet transparency requirements.

  5. Debugging: SHAP can help identify when a model is relying on irrelevant or incorrect features, which can improve model debugging and refinement.

  6. Accuracy: SHAP’s foundation on Shapley values provides accurate and mathematically sound explanations, ensuring that the feature attributions are consistent and unbiased across models.

Why is Model Interpretability Important?¶

  1. Trust: By understanding why a model makes certain predictions, users can gain more trust in its decisions.
  2. Debugging: SHAP helps identify issues, such as when a model relies on irrelevant features or biases.
  3. Compliance: In industries where transparency is crucial (healthcare, finance, cyber security), SHAP provides insights to meet regulatory requirements.
  4. Model Insights: Understanding feature importance and interactions can provide valuable insights for improving models or business processes.

SHAP in Action¶

Key Benefits:¶
  • Model-Agnostic: SHAP works with any machine learning model (tree-based models, neural networks).
  • Visualizations: SHAP offers powerful visualization tools like force plots, summary plots, and dependence plots, making it easier to interpret the impact of features.
  • Feature Interactions: SHAP can also explain how features interact with each other to immethod for feature

Here we Gonna do SHAP on XGBoost with Boruta-selected features after 20 trials

In [574]:
%%time
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test[boruta_selected_features])
CPU times: total: 1h 33min 19s
Wall time: 23min 53s
In [575]:
for k, target_label in enumerate(encoding_map.keys()):
    print(f"SHAP Summary for '{target_label.capitalize()}'")
    print(f"--------------------------------------------------------------------------------------")
    shap_values_benign = shap_values[:, :, k]
    shap.summary_plot( shap_values_benign, X_test[boruta_selected_features], plot_size=(8,8), )
    plt.show()
    print("\n\n")
SHAP Summary for 'Benign'
--------------------------------------------------------------------------------------
No description has been provided for this image


SHAP Summary for 'Phishing'
--------------------------------------------------------------------------------------
No description has been provided for this image


SHAP Summary for 'Defacement'
--------------------------------------------------------------------------------------
No description has been provided for this image


SHAP Summary for 'Malware'
--------------------------------------------------------------------------------------
No description has been provided for this image


In [576]:
shap_values_mean = np.mean(np.abs(shap_values), axis=2)
print("Aggregated SHAP Summary for All Classes")
print(f"--------------------------------------------------------------------------------------")
shap.summary_plot( shap_values_mean, X_test[boruta_selected_features] , plot_size=(10,8), plot_type='bar')
Aggregated SHAP Summary for All Classes
--------------------------------------------------------------------------------------
No description has been provided for this image

Conclusion on SHAP Analysis¶

Through the SHAP Summary and Aggregated SHAP Summary, we observe that the SHAP values align closely with the model's feature importance rankings. This alignment indicates that all the selected features meaningfully contribute to the model's performance.

In the SHAP Summary, it becomes clear that each class has its own set of important features that assist with classification. For example, features like is_abnormal_url, count_http, and count_// have a significant influence across all classes. Meanwhile, certain features, such as is_abnormal_url, play a more prominent role specifically in identifying "Benign" and "Defacement" cases, while count_subdomain is more important for identifying "Phishing" cases.

This analysis provides deeper insights into how individual features drive model decisions and helps validate our feature selection process. cess.

¶


Deep Learning Models¶

In [58]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().set_output(transform="pandas")

ds = X_train.describe().T

binary_cols = ds[(ds['max'] == 1) & (ds['min'] == 0) & (ds['50%'] == 0)].index

X_train_binary = X_train[binary_cols]
X_test_binary = X_test[binary_cols]
X_train_numerical = X_train.drop(binary_cols, axis=1)
X_test_numerical = X_test.drop(binary_cols, axis=1)

X_train_numerical_scaled = scaler.fit_transform(X_train_numerical)
X_test_numerical_scaled  = scaler.transform(X_test_numerical)


X_train_scaled = pd.merge(X_train_numerical_scaled, X_train_binary, left_index=True, right_index=True)
X_test_scaled = pd.merge(X_test_numerical_scaled, X_test_binary, left_index=True, right_index=True)
In [59]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader,TensorDataset, WeightedRandomSampler
In [60]:
def create_fnn_model(input_size: int, output_size: int, layer_units: List[int], activation_functions: List[str],
                     lr: float, gamma: float, step_size: int, L2lambda: float, dropout_rate: float, doBN: bool) -> Tuple[nn.Module, nn.Module, optim.Optimizer, optim.lr_scheduler.StepLR]:
    class FnnModel(nn.Module):
        def __init__(self, input_size, output_size, layer_units, activation_functions, dropout_rate, doBN):
            super().__init__()

            if len(activation_functions) != len(layer_units):
                raise ValueError(f"The number of activation functions must match the number of layers: "
                                 f"len of activation_functions: {len(activation_functions)}, "
                                 f"len of layer_units: {len(layer_units)}")

            self.layers = nn.ModuleDict()
            self.input_size = input_size
            self.output_size = output_size
            self.nLayers = len(layer_units)
            self.layer_units = layer_units
            self.dropout_rate = dropout_rate
            self.doBN = doBN
            self.activation_functions = activation_functions

            self.layers['input'] = nn.Linear(self.input_size, self.layer_units[0])
            if self.doBN:
                self.layers['bn_input'] = nn.BatchNorm1d(self.layer_units[0])

            for layer in range(1, self.nLayers):
                self.layers[f'hidden_{layer}'] = nn.Linear(self.layer_units[layer - 1], self.layer_units[layer])
                if self.doBN:
                    self.layers[f'bn_{layer}'] = nn.BatchNorm1d(self.layer_units[layer])
                if dropout_rate > 0:
                    self.layers[f'dropout_{layer}'] = nn.Dropout(dropout_rate)

            self.layers['output'] = nn.Linear(self.layer_units[-1], self.output_size)

        def forward(self, x):
            actfun = getattr(F, self.activation_functions[0], None)
            if actfun is None:
                raise ValueError(f"Activation function '{self.activation_functions[0]}' is not defined.")
            
            x = self.layers['input'](x)

            if self.doBN:
                x = self.layers['bn_input'](x)

            x = actfun(x)

            for fc in range(1, self.nLayers):
                x = self.layers[f'hidden_{fc}'](x)
                if self.doBN:
                    x = self.layers[f'bn_{fc}'](x)

                actfun = getattr(F, self.activation_functions[fc], None)
                if actfun is None:
                    raise ValueError(f"Activation function '{self.activation_functions[fc]}' is not defined.")
                    
                x = actfun(x)
                if self.dropout_rate > 0:
                    x = self.layers[f'dropout_{fc}'](x)

            x = self.layers['output'](x)
            return x
    
    net = FnnModel(input_size, output_size, layer_units, activation_functions, dropout_rate, doBN)
    lossfun = nn.CrossEntropyLoss()
    optimizer = optim.Adam(net.parameters(), lr=lr, weight_decay=L2lambda)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gamma=gamma)

    return net, lossfun, optimizer, scheduler

def function2trainTheModel(numepochs: int, train_loader: DataLoader, test_loader: DataLoader, net: nn.Module, 
                            lossfun: nn.Module, optimizer: optim.Optimizer, scheduler: optim.lr_scheduler._LRScheduler, 
                            computation_metric: Callable, verbos: bool = 1) -> Tuple[nn.Module, torch.Tensor, List[float], List[float]]:
    
    start_time = time.time()
    
    losses = torch.zeros(numepochs)
    trainAcc = []
    testAcc = []
    
    
    for epochi in range(numepochs):
        epoch_start_time = time.time()
        
        net.train()
        batchAcc = []
        batchLoss = []
        
        for X, y in train_loader:
            yHat = net(X)
            loss = lossfun(yHat, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            batchLoss.append(loss.item())
            batchAcc.append(computation_metric(yHat, y))
        
        scheduler.step()
        if epochi % scheduler.step_size == 0 and verbos:
            print(f"\n*** Epoch {epochi+1}, Step Size: {epochi}, Learning Rate: {scheduler.get_last_lr()[0]} ***\n")
        
        trainAcc.append(np.mean(batchAcc))
        losses[epochi] = np.mean(batchLoss)
        
        epoch_end_time = time.time()
        epoch_duration = epoch_end_time - epoch_start_time
        
        if epochi == 1:
            estimated_total_time = epoch_duration * numepochs
            print(f"\n*** Estimated total time for training: {estimated_total_time:.2f} seconds, {estimated_total_time/60:.2f} minutes. ***\n")

        if verbos:
            print(f'Epoch {epochi+1}/{numepochs}, Loss: {losses[epochi]:.4f}, elapsed time: {epoch_duration:.2f} sec')
              
        net.eval()
        X, y = next(iter(test_loader))
        with torch.no_grad():
            yHat = net(X)
        testAcc.append(computation_metric(yHat, y))
    
    total_time = time.time() - start_time
    print(f"Total time elapsed: {total_time:.2f} sec, {total_time/60:.2f} minutes")
    
    return net, losses, trainAcc, testAcc, yHat


def plot_training_metrics(losses, trainAcc, testAcc, metricName)-> None:
    fig, ax = plt.subplots(2, 1, figsize=(12,8))

    ax[0].plot(losses, label='Losses')
    ax[0].set_title("Losses")
    ax[0].set_xlabel("Number of epochs")
    ax[0].set_ylabel("Loss")
    ax[0].legend([f'Loss: {losses[-1]:.3f}'])

    ax[1].plot(trainAcc, label='Train')
    ax[1].plot(testAcc, label='Test')
    ax[1].set_title(f"{metricName}")
    ax[1].set_xlabel("Number of epochs")
    ax[1].set_ylabel(f"{metricName}")
    ax[1].legend([f'Train: {trainAcc[-1]:.3f}', f'Test: {testAcc[-1]:.3f}'])

    plt.tight_layout()
    plt.show()

Initial Network CONFIG¶

In [61]:
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

numepochs = 30
step_size  = 10
gamma = 0.8
dropout_rate = 0.2
layer_units = [ 128, 256, 512, 256, 128 ]
learningRate = 0.01
L2lambda     = 0.0
compute_accuracy_multi = lambda yHat, y: (torch.argmax(torch.softmax(yHat, dim=1), dim=1) == y).float().sum().item() / len(y) * 100

Balancing Data with WeightedRandomSampler in PyTorch¶

To handle class imbalance, we can use WeightedRandomSampler in PyTorch. This sampler assigns weights to each sample, ensuring that minority classes are represented more frequently during training.

  1. Calculate Sample Weights: Compute weights for each sample using compute_sample_weight from sklearn.
  2. Create a WeightedRandomSampler: Pass the sample weights to WeightedRandomSampler.
  3. Use in DataLoader: Add the sampler to DataLoaderto balance class representation.
In [62]:
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)
sample_weights_tensor = torch.tensor(sample_weights, dtype=torch.float32)
sampler = WeightedRandomSampler(weights=sample_weights_tensor, num_samples=len(sample_weights_tensor), replacement=True)

Manuel Hyperparameters Search (Base on Experiment)¶


WorkFlow¶
  1. Network Structure: Batch Normalization (True / False) AND Activation Functions ('relu6' or 'tanh')
  2. Network Learning: Batch Size (1024 / 2048 / 4096) AND Learning Rate (0.001 / 0.01) AND Gamma (1.0 / 0.7)
  3. Network Regularization: Dropout Rate (0.0 / 0.2 / 0.4)
  4. Network Architecture: Breadth net vs. Depth net
Network Structure¶
In [661]:
%%time

X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)

X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)


batch_size = 2048
train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset, batch_size=test_dataDataset.tensors[0].shape[0], shuffle=False)

input_size = train_loader.dataset.tensors[0].shape[1]
output_size = len(y_train.unique())

relu6 = ['relu6', 'relu6', 'relu6', 'relu6', 'relu6']
tanh = ['tanh', 'tanh', 'tanh', 'tanh', 'tanh']

results_dict = {
    "relu6_BatchNorm_True": {"Test Accuracy": None, "Time": None},
    "relu6_BatchNorm_False": {"Test Accuracy": None, "Time": None},
    "tanh_BatchNorm_True": {"Test Accuracy": None, "Time": None},
    "tanh_BatchNorm_False": {"Test Accuracy": None, "Time": None},
}

for functions, activation_functions in zip(["relu6", "tanh"], [relu6, tanh]):
    for norm in [True, False]:
        config = f"{functions}_BatchNorm_{norm}"
        print(f"Activation Functions: {functions} | BatchNorm: {norm}")
        
        start_time = time.time()

        net, lossfun, optimizer, scheduler = create_fnn_model(
            input_size, output_size, layer_units, activation_functions,
            learningRate, gamma, step_size, L2lambda, dropout_rate, norm
        )

        net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
            numepochs, train_loader, test_loader, net, lossfun, optimizer, scheduler,
            compute_accuracy_multi, verbos=0
        )

        results_dict[config]["Test Accuracy"] = testAcc[-1]
        results_dict[config]["Time"] = time.time() - start_time
        
        print(f"Completed for Activation Functions: {functions} | BatchNorm: {norm}\n")

configs = list(results_dict.keys())
test_accuracies = [metrics["Test Accuracy"] for metrics in results_dict.values()]
times = [metrics["Time"] for metrics in results_dict.values()]

fig, ax = plt.subplots(2, 1, figsize=(14, 7))

sns.barplot(x=configs, y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy vs Activation Functions and BatchNorm')
ax[0].set_xlabel('Configurations (Activation Function + BatchNorm)')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)

for p in ax[0].patches:
    ax[0].annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center', xytext=(0, 9), textcoords='offset points')

sns.barplot(x=configs, y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Time Taken vs Activation Functions and BatchNorm')
ax[1].set_xlabel('Configurations (Activation Function + BatchNorm)')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')

for p in ax[1].patches:
    ax[1].annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center', xytext=(0, 9), textcoords='offset points')

plt.tight_layout()
plt.show()
Activation Functions: relu6 | BatchNorm: True

*** Estimated total time for training: 678.90 seconds, 11.32 minutes. ***

Total time elapsed: 774.52 sec, 12.91 minutes
Completed for Activation Functions: relu6 | BatchNorm: True

Activation Functions: relu6 | BatchNorm: False

*** Estimated total time for training: 595.95 seconds, 9.93 minutes. ***

Total time elapsed: 737.40 sec, 12.29 minutes
Completed for Activation Functions: relu6 | BatchNorm: False

Activation Functions: tanh | BatchNorm: True

*** Estimated total time for training: 667.93 seconds, 11.13 minutes. ***

Total time elapsed: 767.24 sec, 12.79 minutes
Completed for Activation Functions: tanh | BatchNorm: True

Activation Functions: tanh | BatchNorm: False

*** Estimated total time for training: 586.91 seconds, 9.78 minutes. ***

Total time elapsed: 704.93 sec, 11.75 minutes
Completed for Activation Functions: tanh | BatchNorm: False

No description has been provided for this image
CPU times: total: 3h 57min 28s
Wall time: 49min 45s

Experiment Summary of Results from Activation Functions and Batch Normalization¶

Summary of Batch Size, Learning Rate and Gamma The results show the following key findings:

  • Batch Normalization (doBN=True) consistently outperformed the models without batch normalization:

    • relu6_BatchNorm_True: 94.24.%** accuracy in 744 seconds.
    • tanh_BatchNorm_True: 93.95% accuracy in 767 seconds.
  • Without Batch Normalization (doBN=False):

    • relu6_BatchNorm_False: 93.66% accuracy in 737 seconds.
    • tanh_BatchNorm_False: 92.34% accuracy in 704.49 seconds.
  • Conclusion:

    • Batch normalization shows a slight improvement over not using batch normalization.
    • The difference between ReLU6 and Tanh is relatively small, with no significant advantage of one over the other.
Network Learning¶
In [75]:
%%time

doBN = True
activation_functions = ['relu6']*len(layer_units)

batch_sizes = [1024, 2048, 4096]
learning_rates = [0.001, 0.01]
gammas = [1.0, .7] 
step_size = numepochs // 2

results = {}

X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)  
X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)

input_size = X_train_scaled_tensor.shape[1]
output_size = len(y_train.unique())

for batch_size in batch_sizes:
    for lr in learning_rates:
        for gamma in gammas:
            print(f"\nTesting Batch Size: {batch_size}, Learning Rate: {lr}, Gamma: {gamma}")
            
            train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
            test_loader = DataLoader(test_dataDataset, batch_size=len(test_dataDataset), shuffle=False)
            
            start_time = time.time()
            
            net, lossfun, optimizer, scheduler = create_fnn_model(
                input_size=input_size, output_size=output_size, 
                layer_units=layer_units, activation_functions=activation_functions, 
                lr=lr, gamma=gamma, step_size=step_size, 
                L2lambda=0, dropout_rate=0, doBN=doBN
            )

            net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
                numepochs=numepochs, train_loader=train_loader, 
                test_loader=test_loader, net=net, lossfun=lossfun, 
                optimizer=optimizer, scheduler=scheduler, 
                computation_metric=compute_accuracy_multi, verbos=0
            )

            elapsed_time = time.time() - start_time
            
            results[(batch_size, lr, gamma)] = {
                "Test Accuracy": testAcc[-1],
                "Time": elapsed_time
            }


for params, metrics in results.items():
    print(f"Batch Size: {params[0]}, LR: {params[1]}, Gamma: {params[2]} --> Accuracy: {metrics['Test Accuracy']:.2f}, Time: {metrics['Time']:.2f} seconds")

configs = [f'BS: {params[0]}, LR: {params[1]}, γ: {params[2]}' for params in results.keys()]
test_accuracies = [metrics["Test Accuracy"] for metrics in results.values()]
times = [metrics["Time"] for metrics in results.values()]

fig, ax = plt.subplots(2, 1, figsize=(14, 7))

sns.barplot(x=configs, y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy vs Batch Size, Learning Rate, and Gamma')
ax[0].set_xlabel('Configurations (Batch Size, Learning Rate, Gamma)')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)

for p in ax[0].patches:
    ax[0].annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center', xytext=(0, 9), textcoords='offset points')

sns.barplot(x=configs, y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Time Taken vs Batch Size, Learning Rate, and Gamma')
ax[1].set_xlabel('Configurations (Batch Size, Learning Rate, Gamma)')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')

for p in ax[1].patches:
    ax[1].annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center', xytext=(0, 9), textcoords='offset points')

plt.tight_layout()
plt.show()
Testing Batch Size: 1024, Learning Rate: 0.001, Gamma: 1.0

*** Estimated total time for training: 1075.80 seconds, 17.93 minutes. ***

Total time elapsed: 1250.89 sec, 20.85 minutes

Testing Batch Size: 1024, Learning Rate: 0.001, Gamma: 0.7

*** Estimated total time for training: 1072.31 seconds, 17.87 minutes. ***

Total time elapsed: 1243.56 sec, 20.73 minutes

Testing Batch Size: 1024, Learning Rate: 0.01, Gamma: 1.0

*** Estimated total time for training: 1058.10 seconds, 17.64 minutes. ***

Total time elapsed: 1252.66 sec, 20.88 minutes

Testing Batch Size: 1024, Learning Rate: 0.01, Gamma: 0.7

*** Estimated total time for training: 1081.19 seconds, 18.02 minutes. ***

Total time elapsed: 1263.18 sec, 21.05 minutes

Testing Batch Size: 2048, Learning Rate: 0.001, Gamma: 1.0

*** Estimated total time for training: 1027.10 seconds, 17.12 minutes. ***

Total time elapsed: 1178.46 sec, 19.64 minutes

Testing Batch Size: 2048, Learning Rate: 0.001, Gamma: 0.7

*** Estimated total time for training: 995.41 seconds, 16.59 minutes. ***

Total time elapsed: 1175.72 sec, 19.60 minutes

Testing Batch Size: 2048, Learning Rate: 0.01, Gamma: 1.0

*** Estimated total time for training: 979.24 seconds, 16.32 minutes. ***

Total time elapsed: 1182.46 sec, 19.71 minutes

Testing Batch Size: 2048, Learning Rate: 0.01, Gamma: 0.7

*** Estimated total time for training: 1007.06 seconds, 16.78 minutes. ***

Total time elapsed: 1186.15 sec, 19.77 minutes

Testing Batch Size: 4096, Learning Rate: 0.001, Gamma: 1.0

*** Estimated total time for training: 974.17 seconds, 16.24 minutes. ***

Total time elapsed: 1167.93 sec, 19.47 minutes

Testing Batch Size: 4096, Learning Rate: 0.001, Gamma: 0.7

*** Estimated total time for training: 1038.73 seconds, 17.31 minutes. ***

Total time elapsed: 1174.83 sec, 19.58 minutes

Testing Batch Size: 4096, Learning Rate: 0.01, Gamma: 1.0

*** Estimated total time for training: 1031.37 seconds, 17.19 minutes. ***

Total time elapsed: 1181.84 sec, 19.70 minutes

Testing Batch Size: 4096, Learning Rate: 0.01, Gamma: 0.7

*** Estimated total time for training: 1048.11 seconds, 17.47 minutes. ***

Total time elapsed: 1180.69 sec, 19.68 minutes
Batch Size: 1024, LR: 0.001, Gamma: 1.0 --> Accuracy: 95.19, Time: 1250.91 seconds
Batch Size: 1024, LR: 0.001, Gamma: 0.7 --> Accuracy: 95.19, Time: 1243.56 seconds
Batch Size: 1024, LR: 0.01, Gamma: 1.0 --> Accuracy: 94.93, Time: 1252.67 seconds
Batch Size: 1024, LR: 0.01, Gamma: 0.7 --> Accuracy: 95.36, Time: 1263.18 seconds
Batch Size: 2048, LR: 0.001, Gamma: 1.0 --> Accuracy: 94.98, Time: 1178.48 seconds
Batch Size: 2048, LR: 0.001, Gamma: 0.7 --> Accuracy: 94.90, Time: 1175.75 seconds
Batch Size: 2048, LR: 0.01, Gamma: 1.0 --> Accuracy: 94.66, Time: 1182.47 seconds
Batch Size: 2048, LR: 0.01, Gamma: 0.7 --> Accuracy: 95.34, Time: 1186.15 seconds
Batch Size: 4096, LR: 0.001, Gamma: 1.0 --> Accuracy: 95.30, Time: 1167.93 seconds
Batch Size: 4096, LR: 0.001, Gamma: 0.7 --> Accuracy: 95.34, Time: 1174.83 seconds
Batch Size: 4096, LR: 0.01, Gamma: 1.0 --> Accuracy: 94.64, Time: 1181.84 seconds
Batch Size: 4096, LR: 0.01, Gamma: 0.7 --> Accuracy: 94.69, Time: 1180.71 seconds
No description has been provided for this image
CPU times: total: 19h 42min 25s
Wall time: 4h 1min 12s

Summary of Batch Size, Learning Rate, and Gamma Experiments¶

This experiment tested different Batch Sizes (1024, 2048, 4096), Learning Rates (0.001, 0.01), and Gamma values (1.0, 0.7) to assess their impact on model performance.

Key Findings:¶
  • Batch Size 1024 consistently achieved the highest accuracy, with up to 95.36% at LR 0.01, Gamma 0.7.
  • Gamma 1.0 generally performed better, especially at LR 0.001.
  • Training Time: Smaller batch sizes (1024) took longer but yielded higher accuracy, whereas larger batch sizes (2048, 4096) were faster but slightly less accurate.
Conclusion:¶

The optimal setup was Batch Size 1024, Learning Rate 0.001, and Gamma 1.0 for the best balance of accuracy (95.19%) and efficiency.

Network Regularization¶
In [44]:
%%time

doBN = True
activation_functions = ['relu6']*len(layer_units)
batch_size = 1024
lr = 0.001
gamma = 1.0
dropout_rates = [0.0, 0.2, 0.4]  

results = {}

X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)

train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset, batch_size=len(test_dataDataset), shuffle=False)

input_size = X_train_scaled_tensor.shape[1]
output_size = len(y_train.unique())

for dropout_rate in dropout_rates:
    print(f"\nTesting Dropout Rate: {dropout_rate}")
              
    start_time = time.time()
    
    net, lossfun, optimizer, scheduler = create_fnn_model(
        input_size=input_size, output_size=output_size, 
        layer_units=layer_units, activation_functions=activation_functions,
        lr=lr, gamma=gamma, step_size=numepochs, L2lambda = 0.0,
        dropout_rate=dropout_rate, doBN=doBN  
    )

    net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
        numepochs=numepochs, train_loader=train_loader, 
        test_loader=test_loader, net=net, lossfun=lossfun, 
        optimizer=optimizer, scheduler=scheduler, 
        computation_metric=compute_accuracy_multi, verbos=0
    )

    elapsed_time = time.time() - start_time
    
    results[dropout_rate] = {
        "Test Accuracy": testAcc[-1],
        "Time": elapsed_time
    }

for dropout_rate, metrics in results.items():
    print(f"Dropout Rate: {dropout_rate} --> Accuracy: {metrics['Test Accuracy']:.2f}, Time: {metrics['Time']:.2f} seconds")

dropout_rates = list(results.keys())
test_accuracies = [metrics["Test Accuracy"] for metrics in results.values()]
times = [metrics["Time"] for metrics in results.values()]

fig, ax = plt.subplots(2, 1, figsize=(14, 7))

sns.barplot(x=dropout_rates, y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy vs Dropout Rate')
ax[0].set_xlabel('Dropout Rate')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)

for p in ax[0].patches:
    ax[0].annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center', xytext=(0, 9), textcoords='offset points')

sns.barplot(x=dropout_rates, y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Time Taken vs Dropout Rate')
ax[1].set_xlabel('Dropout Rate')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')

for p in ax[1].patches:
    ax[1].annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center', xytext=(0, 9), textcoords='offset points')

plt.tight_layout()
plt.show()
Testing Dropout Rate: 0.0

*** Estimated total time for training: 496.70 seconds, 8.28 minutes. ***

Total time elapsed: 498.06 sec, 8.30 minutes

Testing Dropout Rate: 0.2

*** Estimated total time for training: 421.51 seconds, 7.03 minutes. ***

Total time elapsed: 504.22 sec, 8.40 minutes

Testing Dropout Rate: 0.4

*** Estimated total time for training: 431.92 seconds, 7.20 minutes. ***

Total time elapsed: 498.00 sec, 8.30 minutes
Dropout Rate: 0.0 --> Accuracy: 94.88, Time: 498.07 seconds
Dropout Rate: 0.2 --> Accuracy: 93.98, Time: 504.22 seconds
Dropout Rate: 0.4 --> Accuracy: 93.76, Time: 498.00 seconds
No description has been provided for this image
CPU times: total: 2h 6min 29s
Wall time: 25min

Summary of Dropout Rate Experiment¶

Experiments were conducted with Dropout Rates of 0.0, 0.2, and 0.4 to observe the effect on accuracy and training time.

Key Findings:¶
  • Dropout Rate 0.0 yielded the best accuracy at 94.88% and the fastest training time of 498.07 seconds.
  • Higher Dropout Rates (0.2 and 0.4) slightly reduced accuracy, with 93.98% and 93.76%, respectively, and did not significantly impact training time.
Conclusion:¶

A Dropout Rate of 0.0 is optimal for achieving the highest accuracy with efficient training time. ease in training time.

Network Architecture¶
In [45]:
batch_size = 1024
lr = 0.001
gamma = 1.0
doBN = True
dropout_rate = 0.0
L2lambda = 0.0

layer_units_wide = [1024, 1024]  # Breadth Net: Wider layers
layer_units_deep = [128, 256, 512, 256, 128]  # Depth Net: More layers

results = {}

train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader  = DataLoader(test_dataDataset,batch_size=test_dataDataset.tensors[0].shape[0])

for network_type, layer_units in [("Deep", layer_units_deep), ("Wide", layer_units_wide)]:
    print(f"\nTesting {network_type} Network")
    

    start_time = time.time()
    
    net, lossfun, optimizer, scheduler = create_fnn_model(
        input_size=input_size, output_size=output_size, 
        layer_units=layer_units, activation_functions=['relu6']*len(layer_units), 
        lr=lr, gamma=gamma, step_size=numepochs, 
        L2lambda=L2lambda, dropout_rate=dropout_rate, doBN=True  
    )

    net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
        numepochs=numepochs, train_loader=train_loader, 
        test_loader=test_loader, net=net, lossfun=lossfun, 
        optimizer=optimizer, scheduler=scheduler, 
        computation_metric=compute_accuracy_multi, verbos=0
    )

    elapsed_time = time.time() - start_time
    
    results[network_type] = {
        "Test Accuracy": testAcc[-1],
        "Time": elapsed_time
    }

for net_type, metrics in results.items():
    print(f"{net_type} Network --> Accuracy: {metrics['Test Accuracy']:.2f}, Time: {metrics['Time']:.2f} seconds")

network_types = results.keys()
test_accuracies = [metrics["Test Accuracy"] for metrics in results.values()]
times = [metrics["Time"] for metrics in results.values()]

fig, ax = plt.subplots(2, 1, figsize=(10, 6))

sns.barplot(x=list(network_types), y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy: Deep vs. Wide Network')
ax[0].set_xlabel('Network Type')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)

sns.barplot(x=list(network_types), y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Training Time: Deep vs. Wide Network')
ax[1].set_xlabel('Network Type')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')

plt.tight_layout()
plt.show()
Testing Deep Network

*** Estimated total time for training: 373.12 seconds, 6.22 minutes. ***

Total time elapsed: 453.39 sec, 7.56 minutes

Testing Wide Network

*** Estimated total time for training: 738.78 seconds, 12.31 minutes. ***

Total time elapsed: 806.30 sec, 13.44 minutes
Deep Network --> Accuracy: 95.19, Time: 453.39 seconds
Wide Network --> Accuracy: 94.69, Time: 806.30 seconds
No description has been provided for this image

Experiment Summary of Deep vs. Wide Network Architectures¶

The experiments compared Deep and Wide network architectures to evaluate their impact on model accuracy and training time.

Key Findings:¶
  • Deep Network achieved the highest accuracy at 95.19% with a faster training time of 453.39 seconds.
  • Wide Network showed slightly lower accuracy at 94.69% and took significantly longer to train at 806.30 seconds.
Conclusion:¶

The Deep Network is the preferred architecture, providing both higher accuracy and faster training.ompared to a wider network.

¶

Hyperparameter Search Summary¶

Throughout the experiments, various combinations of hyperparameters were tested to optimize model accuracy and efficiency. Across different configurations, the results consistently achieved high accuracy (above 93%), showing the robustness of the model under a variety of settings. Although there were no extreme differences between configurations, the exploration process helped identify the best-performing combination.

This hyperparameter search demonstrated the importance of adjusting parameters like batch size, learning rate, gamma, batch normalization, dropout rate, and network architecture to achieve optimal performance. While there are infinite possible configurations, the final results led to a preferred combination.

Additional Notes¶

  • The experiment was conducted on a CPU, and using a GPU could significantly speed up the process.
  • The FNN model performed well across almost all tested configurations.

Conclusion¶

The best configuration was achieved with:

  • Batch Size: 1024
  • Learning Rate (lr): 0.001
  • Gamma: 1.0
  • Batch Normalization (doBN): True
  • Dropout Rate: 0.0
  • Network Architecture: Deep network

This combination provides an effective balance of accuracy and efficiency for the given task.

Updated Network CONFIG¶

In [64]:
numepochs = 50
doBN = True
batch_size = 1024
learningRate = 0.001
gamma = 1.0
step_size  = numepochs
dropout_rate = 0.0
L2lambda     = 0.0
layer_units = [ 128, 256, 512, 256, 128 ]
activation_functions = ['relu'] * len(layer_units)


X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)  

X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset  = TensorDataset(X_test_scaled_tensor, y_test_tensor)

sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)
sample_weights_tensor = torch.tensor(sample_weights, dtype=torch.float32)
sampler = WeightedRandomSampler(weights=sample_weights_tensor, num_samples=len(sample_weights_tensor), replacement=True)

train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader  = DataLoader(test_dataDataset,batch_size=test_dataDataset.tensors[0].shape[0])
   
input_size = train_loader.dataset.tensors[0].shape[1]
output_size  = len(y_train.unique()) # 4

Next Steps - Model Training with Best Parameters¶

In this section, we will run the model using the optimal hyperparameters identified earlier. We will conduct two experiments:

  1. Model Training with All Features
  2. Model Training with BorutaPy Selected Features

All Features¶

In [47]:
net, lossfun, optimizer, scheduler  = create_fnn_model(input_size, output_size, layer_units, activation_functions , learningRate, gamma, step_size, L2lambda, dropout_rate, doBN)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(numepochs, train_loader, test_loader, net, lossfun, optimizer, scheduler, compute_accuracy_multi)
plot_training_metrics(losses, trainAcc, testAcc, 'Accuracy For All Features')

predicted_classes = torch.argmax(yHat, dim=1).cpu().numpy()
check_accuracy(y_test, predicted_classes, 'All Features')
*** Epoch 1, Step Size: 0, Learning Rate: 0.001 ***

Epoch 1/50, Loss: 0.2520, elapsed time: 17.55 sec

*** Estimated total time for training: 730.51 seconds, 12.18 minutes. ***

Epoch 2/50, Loss: 0.1843, elapsed time: 14.61 sec
Epoch 3/50, Loss: 0.1646, elapsed time: 14.05 sec
Epoch 4/50, Loss: 0.1502, elapsed time: 14.01 sec
Epoch 5/50, Loss: 0.1417, elapsed time: 14.14 sec
Epoch 6/50, Loss: 0.1343, elapsed time: 13.65 sec
Epoch 7/50, Loss: 0.1288, elapsed time: 16.11 sec
Epoch 8/50, Loss: 0.1233, elapsed time: 13.84 sec
Epoch 9/50, Loss: 0.1193, elapsed time: 13.94 sec
Epoch 10/50, Loss: 0.1171, elapsed time: 13.81 sec
Epoch 11/50, Loss: 0.1115, elapsed time: 13.63 sec
Epoch 12/50, Loss: 0.1104, elapsed time: 13.73 sec
Epoch 13/50, Loss: 0.1065, elapsed time: 13.72 sec
Epoch 14/50, Loss: 0.1051, elapsed time: 13.78 sec
Epoch 15/50, Loss: 0.1027, elapsed time: 12.96 sec
Epoch 16/50, Loss: 0.0996, elapsed time: 12.88 sec
Epoch 17/50, Loss: 0.0977, elapsed time: 13.87 sec
Epoch 18/50, Loss: 0.0957, elapsed time: 13.61 sec
Epoch 19/50, Loss: 0.0949, elapsed time: 13.45 sec
Epoch 20/50, Loss: 0.0933, elapsed time: 12.83 sec
Epoch 21/50, Loss: 0.0910, elapsed time: 15.65 sec
Epoch 22/50, Loss: 0.0899, elapsed time: 15.83 sec
Epoch 23/50, Loss: 0.0890, elapsed time: 15.25 sec
Epoch 24/50, Loss: 0.0877, elapsed time: 15.23 sec
Epoch 25/50, Loss: 0.0863, elapsed time: 15.76 sec
Epoch 26/50, Loss: 0.0846, elapsed time: 17.00 sec
Epoch 27/50, Loss: 0.0846, elapsed time: 14.98 sec
Epoch 28/50, Loss: 0.0823, elapsed time: 14.20 sec
Epoch 29/50, Loss: 0.0820, elapsed time: 14.89 sec
Epoch 30/50, Loss: 0.0806, elapsed time: 20.45 sec
Epoch 31/50, Loss: 0.0795, elapsed time: 20.40 sec
Epoch 32/50, Loss: 0.0790, elapsed time: 15.66 sec
Epoch 33/50, Loss: 0.0776, elapsed time: 15.83 sec
Epoch 34/50, Loss: 0.0763, elapsed time: 15.83 sec
Epoch 35/50, Loss: 0.0769, elapsed time: 15.20 sec
Epoch 36/50, Loss: 0.0755, elapsed time: 20.76 sec
Epoch 37/50, Loss: 0.0746, elapsed time: 18.93 sec
Epoch 38/50, Loss: 0.0735, elapsed time: 18.07 sec
Epoch 39/50, Loss: 0.0736, elapsed time: 16.63 sec
Epoch 40/50, Loss: 0.0730, elapsed time: 14.97 sec
Epoch 41/50, Loss: 0.0723, elapsed time: 18.76 sec
Epoch 42/50, Loss: 0.0709, elapsed time: 13.29 sec
Epoch 43/50, Loss: 0.0710, elapsed time: 14.25 sec
Epoch 44/50, Loss: 0.0704, elapsed time: 17.97 sec
Epoch 45/50, Loss: 0.0704, elapsed time: 17.74 sec
Epoch 46/50, Loss: 0.0689, elapsed time: 20.39 sec
Epoch 47/50, Loss: 0.0680, elapsed time: 16.52 sec
Epoch 48/50, Loss: 0.0679, elapsed time: 14.49 sec
Epoch 49/50, Loss: 0.0680, elapsed time: 13.22 sec
Epoch 50/50, Loss: 0.0671, elapsed time: 12.91 sec
Total time elapsed: 890.19 sec, 14.84 minutes
No description has been provided for this image
No description has been provided for this image

BorutaPy Selected Features¶

In [48]:
boruta_net, boruta_lossfun, optimizer, scheduler  = create_fnn_model(input_size, output_size, layer_units, activation_functions , learningRate, gamma, step_size, L2lambda, dropout_rate, doBN)
boruta_net, boruta_losses, boruta_trainAcc,boruta_testAcc, boruta_yHat = function2trainTheModel(numepochs, train_loader, test_loader, boruta_net, lossfun, optimizer, scheduler, compute_accuracy_multi)
plot_training_metrics(boruta_losses, boruta_trainAcc, boruta_testAcc, 'Accuracy For BorutaPy Features')


predicted_classes = torch.argmax(boruta_yHat, dim=1).cpu().numpy()
check_accuracy(y_test, predicted_classes, 'BorutaPy Selected Features')
*** Epoch 1, Step Size: 0, Learning Rate: 0.001 ***

Epoch 1/50, Loss: 0.2465, elapsed time: 12.86 sec

*** Estimated total time for training: 635.34 seconds, 10.59 minutes. ***

Epoch 2/50, Loss: 0.1811, elapsed time: 12.71 sec
Epoch 3/50, Loss: 0.1628, elapsed time: 14.05 sec
Epoch 4/50, Loss: 0.1511, elapsed time: 14.53 sec
Epoch 5/50, Loss: 0.1417, elapsed time: 14.21 sec
Epoch 6/50, Loss: 0.1342, elapsed time: 12.84 sec
Epoch 7/50, Loss: 0.1291, elapsed time: 12.86 sec
Epoch 8/50, Loss: 0.1244, elapsed time: 13.60 sec
Epoch 9/50, Loss: 0.1205, elapsed time: 12.81 sec
Epoch 10/50, Loss: 0.1156, elapsed time: 12.88 sec
Epoch 11/50, Loss: 0.1125, elapsed time: 13.27 sec
Epoch 12/50, Loss: 0.1103, elapsed time: 14.05 sec
Epoch 13/50, Loss: 0.1080, elapsed time: 14.72 sec
Epoch 14/50, Loss: 0.1038, elapsed time: 13.60 sec
Epoch 15/50, Loss: 0.1036, elapsed time: 15.36 sec
Epoch 16/50, Loss: 0.0995, elapsed time: 13.60 sec
Epoch 17/50, Loss: 0.0984, elapsed time: 15.18 sec
Epoch 18/50, Loss: 0.0962, elapsed time: 14.50 sec
Epoch 19/50, Loss: 0.0953, elapsed time: 12.91 sec
Epoch 20/50, Loss: 0.0938, elapsed time: 12.94 sec
Epoch 21/50, Loss: 0.0915, elapsed time: 12.59 sec
Epoch 22/50, Loss: 0.0896, elapsed time: 12.55 sec
Epoch 23/50, Loss: 0.0896, elapsed time: 12.64 sec
Epoch 24/50, Loss: 0.0871, elapsed time: 13.19 sec
Epoch 25/50, Loss: 0.0865, elapsed time: 12.67 sec
Epoch 26/50, Loss: 0.0855, elapsed time: 13.77 sec
Epoch 27/50, Loss: 0.0829, elapsed time: 12.55 sec
Epoch 28/50, Loss: 0.0827, elapsed time: 12.92 sec
Epoch 29/50, Loss: 0.0831, elapsed time: 12.88 sec
Epoch 30/50, Loss: 0.0799, elapsed time: 12.66 sec
Epoch 31/50, Loss: 0.0803, elapsed time: 12.64 sec
Epoch 32/50, Loss: 0.0791, elapsed time: 12.52 sec
Epoch 33/50, Loss: 0.0780, elapsed time: 13.06 sec
Epoch 34/50, Loss: 0.0775, elapsed time: 13.67 sec
Epoch 35/50, Loss: 0.0764, elapsed time: 16.23 sec
Epoch 36/50, Loss: 0.0754, elapsed time: 13.13 sec
Epoch 37/50, Loss: 0.0739, elapsed time: 13.08 sec
Epoch 38/50, Loss: 0.0738, elapsed time: 12.60 sec
Epoch 39/50, Loss: 0.0741, elapsed time: 12.70 sec
Epoch 40/50, Loss: 0.0725, elapsed time: 12.54 sec
Epoch 41/50, Loss: 0.0721, elapsed time: 12.92 sec
Epoch 42/50, Loss: 0.0701, elapsed time: 12.96 sec
Epoch 43/50, Loss: 0.0707, elapsed time: 14.37 sec
Epoch 44/50, Loss: 0.0696, elapsed time: 12.52 sec
Epoch 45/50, Loss: 0.0700, elapsed time: 12.82 sec
Epoch 46/50, Loss: 0.0685, elapsed time: 13.09 sec
Epoch 47/50, Loss: 0.0689, elapsed time: 12.53 sec
Epoch 48/50, Loss: 0.0694, elapsed time: 12.75 sec
Epoch 49/50, Loss: 0.0675, elapsed time: 12.58 sec
Epoch 50/50, Loss: 0.0671, elapsed time: 13.10 sec
Total time elapsed: 775.02 sec, 12.92 minutes
No description has been provided for this image
No description has been provided for this image

Summary of Network Results¶

In this experiment, two different network configurations were trained and tested:

  1. All Features (slightly better performance)
  2. BorutaPy Selected Features
Results Overview:¶
  • Good Accuracy: Both models demonstrated strong overall accuracy in detecting malicious URLs and performed competitively in a relatively short training time.

  • Benign Class:

    • The models achieved around 95% accuracy, indicating robust performance in detecting benign URLs with a low misclassification rate of 4.29%.
  • Defacement Class:

    • The models reached around 99% accuracy, reflecting their high effectiveness in identifying defacement.
  • Malware Class:

    • Both models showed around 93% accuracy, demonstrating solid performance in detecting malware.
  • Phishing Class:

    • Accuracy for the phishing class was slightly lower, with results around 91%, indicating this class remains a challenging area that may benefit from further optimization and feature engineering.
Conclusion¶

The All Features model performed slightly better than the BorutaPy Selected Features model, particularly in capturing subtle differences between classes. Both models achieved strong results across most classes with efficient training times. However, consistent with previous findings, the phishing class remains a weak point, which requires further refinement for improved accuracy. Misclassification rates were low for benign URLs, and misclassifications between malware, defacement, and phishing were relatively minor.


¶

¶


FNN with BERT¶

What is BERT?¶

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model designed to understand context by looking at text in both directions (left-to-right and right-to-left). It is widely used for tasks like text classification, question answering, and feature extraction due to its ability to capture deep, contextual meaning from words and phrases.

What is bert-base-uncased?¶

bert-base-uncased is a version of BERT with 12 layers and 110 million parameters, trained on lowercased English text. It treats text as case-insensitive, meaning all inputs are converted to lowercase, making it efficient for handling tasks where capitalization isn’t important (like URLs). Practically, BERT extracts 768-dimensional feature vectors for each input, allowing it to represent the nuanced semantics of the text.

Why Use bert-base-uncased for URL Detection?¶
  1. Contextual Understanding: URLs often contain subtle patterns that can indicate malicious behavior. BERT’s ability to capture token relationships allows it to detect these nuances effectively.

  2. No Need for Feature Engineering: Instead of manually designing features, BERT automatically extracts rich, meaningful representations from URLs, which can then be classified as phishing, malware, or other categories.

  3. Scalable: Pre-trained models like bert-base-uncased can be fine-tuned for specific tasks (like URL detection) with less data, making it efficient and scalable for large datasets.

  4. Bidirectional Analysis: BERT processes text in both directions, helping capture important patterns across the entire URL, whether they occur at the start or end of the string.

In [104]:
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
In [106]:
%%time
import torch

def extract_bert_features_batch(urls):
    inputs = tokenizer(urls, return_tensors='pt', padding=True, truncation=True, max_length=133)
    
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']  
    
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)        
        token_embeddings = outputs.last_hidden_state  
        sentence_embeddings = torch.mean(token_embeddings, dim=1)
    
    return sentence_embeddings

batch_size = 32
bert_features = []

for i in range(0, len(df), batch_size):    
    batch_urls = df.reset_index().iloc[i:i + batch_size]['url'].tolist()  
    batch_features = extract_bert_features_batch(batch_urls)  
    bert_features.append(batch_features)

bert_features = torch.cat(bert_features, dim=0).numpy()

print("BERT Features Shape:", bert_features.shape)
print("Target Shape:", df['type'].shape)
BERT Features Shape: (608510, 768)
Target Shape: (608510,)
CPU times: total: 2d 15h 19min 41s
Wall time: 11h 53min
In [107]:
from sklearn.model_selection import train_test_split

bert_features_df = pd.DataFrame(bert_features)
labels_df = pd.DataFrame(df['type'].values, columns=['type'])

bert_df = pd.concat([bert_features_df, labels_df], axis=1)

X = bert_df.drop(columns=['type'])  
y = bert_df['type']

X_train_bert, X_test_bert, y_train_bert, y_test_bert = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

print(X_train_bert.shape, X_test_bert.shape, y_train_bert.shape, y_test_bert.shape)
(486808, 768) (121702, 768) (486808,) (121702,)
In [108]:
bert_ds = X_train_bert.describe().T
bert_ds
Out[108]:
count mean std min 25% 50% 75% max
0 486808.0 0.186599 0.142357 -0.756574 0.096685 0.187011 0.275938 0.900724
1 486808.0 -0.079289 0.137129 -0.801882 -0.171787 -0.079434 0.012914 0.775830
2 486808.0 0.317695 0.138945 -0.387983 0.225834 0.309290 0.403508 1.086027
3 486808.0 0.041942 0.163707 -0.714411 -0.057832 0.060781 0.155435 0.736621
4 486808.0 0.292893 0.123600 -0.782637 0.212815 0.291612 0.372815 0.985944
... ... ... ... ... ... ... ... ...
763 486808.0 -0.152188 0.161910 -0.962243 -0.263471 -0.173447 -0.053416 0.589963
764 486808.0 -0.039436 0.123828 -0.688595 -0.119005 -0.037260 0.045138 0.534348
765 486808.0 -0.075762 0.121400 -0.894989 -0.153326 -0.070693 0.003862 0.498655
766 486808.0 -0.068985 0.124215 -0.673868 -0.150809 -0.066869 0.012941 0.798272
767 486808.0 0.137579 0.138483 -0.535462 0.052473 0.138289 0.222824 0.972031

768 rows × 8 columns

In [109]:
y_train_bert = y_train_bert.apply(lambda x: encoding_map[x])
y_test_bert  = y_test_bert.apply(lambda x: encoding_map[x])
In [110]:
X_train_bert_tensor = torch.tensor(X_train_bert.values, dtype=torch.float32)
y_train_bert_tensor = torch.tensor(y_train_bert.values, dtype=torch.long)  

X_test_bert_tensor = torch.tensor(X_test_bert.values, dtype=torch.float32)
y_test_bert_tensor = torch.tensor(y_test_bert.values, dtype=torch.long)

train_dataDataset = TensorDataset(X_train_bert_tensor, y_train_bert_tensor)
test_dataDataset  = TensorDataset(X_test_bert_tensor, y_test_bert_tensor)

train_loader = DataLoader(train_dataDataset,batch_size=batch_size, shuffle=True, drop_last=True)
test_loader  = DataLoader(test_dataDataset,batch_size=test_dataDataset.tensors[0].shape[0])

input_size = train_loader.dataset.tensors[0].shape[1]
output_size  = len(y_train.unique()) # 4
In [111]:
step_size = 30
gamma = 0.8
dropout_rate = 0.2
net, lossfun, optimizer, scheduler  = create_fnn_model(input_size, output_size, layer_units, activation_functions , learningRate, gamma, step_size, L2lambda, dropout_rate, doBN)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(numepochs, train_loader, test_loader, net, lossfun, optimizer, scheduler, compute_accuracy_multi)
plot_training_metrics(losses, trainAcc, testAcc, 'Accuracy For All Features')


predicted_classes = torch.argmax(yHat, dim=1).cpu().numpy()
check_accuracy(y_test_bert, predicted_classes, 'All Features')
*** Epoch 1, Step Size: 0, Learning Rate: 0.001 ***

Epoch 1/50, Loss: 0.1427, elapsed time: 188.30 sec

*** Estimated total time for training: 9468.65 seconds, 157.81 minutes. ***

Epoch 2/50, Loss: 0.0958, elapsed time: 189.37 sec
Epoch 3/50, Loss: 0.0825, elapsed time: 187.48 sec
Epoch 4/50, Loss: 0.0749, elapsed time: 186.15 sec
Epoch 5/50, Loss: 0.0695, elapsed time: 186.40 sec
Epoch 6/50, Loss: 0.0654, elapsed time: 185.68 sec
Epoch 7/50, Loss: 0.0621, elapsed time: 184.97 sec
Epoch 8/50, Loss: 0.0592, elapsed time: 187.07 sec
Epoch 9/50, Loss: 0.0571, elapsed time: 189.39 sec
Epoch 10/50, Loss: 0.0546, elapsed time: 187.32 sec
Epoch 11/50, Loss: 0.0534, elapsed time: 188.09 sec
Epoch 12/50, Loss: 0.0511, elapsed time: 189.27 sec
Epoch 13/50, Loss: 0.0493, elapsed time: 187.42 sec
Epoch 14/50, Loss: 0.0481, elapsed time: 187.98 sec
Epoch 15/50, Loss: 0.0470, elapsed time: 185.96 sec
Epoch 16/50, Loss: 0.0452, elapsed time: 187.40 sec
Epoch 17/50, Loss: 0.0442, elapsed time: 184.66 sec
Epoch 18/50, Loss: 0.0435, elapsed time: 186.69 sec
Epoch 19/50, Loss: 0.0427, elapsed time: 187.27 sec
Epoch 20/50, Loss: 0.0416, elapsed time: 187.89 sec
Epoch 21/50, Loss: 0.0404, elapsed time: 188.09 sec
Epoch 22/50, Loss: 0.0398, elapsed time: 186.92 sec
Epoch 23/50, Loss: 0.0390, elapsed time: 192.95 sec
Epoch 24/50, Loss: 0.0379, elapsed time: 191.48 sec
Epoch 25/50, Loss: 0.0373, elapsed time: 187.20 sec
Epoch 26/50, Loss: 0.0365, elapsed time: 187.88 sec
Epoch 27/50, Loss: 0.0362, elapsed time: 335.88 sec
Epoch 28/50, Loss: 0.0353, elapsed time: 261.89 sec
Epoch 29/50, Loss: 0.0347, elapsed time: 292.15 sec
Epoch 30/50, Loss: 0.0342, elapsed time: 272.56 sec

*** Epoch 31, Step Size: 30, Learning Rate: 0.0008 ***

Epoch 31/50, Loss: 0.0317, elapsed time: 184.13 sec
Epoch 32/50, Loss: 0.0301, elapsed time: 186.29 sec
Epoch 33/50, Loss: 0.0300, elapsed time: 187.48 sec
Epoch 34/50, Loss: 0.0295, elapsed time: 188.71 sec
Epoch 35/50, Loss: 0.0290, elapsed time: 185.94 sec
Epoch 36/50, Loss: 0.0285, elapsed time: 204.32 sec
Epoch 37/50, Loss: 0.0279, elapsed time: 212.09 sec
Epoch 38/50, Loss: 0.0278, elapsed time: 220.62 sec
Epoch 39/50, Loss: 0.0272, elapsed time: 210.81 sec
Epoch 40/50, Loss: 0.0264, elapsed time: 220.84 sec
Epoch 41/50, Loss: 0.0264, elapsed time: 257.80 sec
Epoch 42/50, Loss: 0.0266, elapsed time: 219.34 sec
Epoch 43/50, Loss: 0.0260, elapsed time: 186.69 sec
Epoch 44/50, Loss: 0.0256, elapsed time: 199.13 sec
Epoch 45/50, Loss: 0.0253, elapsed time: 184.95 sec
Epoch 46/50, Loss: 0.0249, elapsed time: 211.25 sec
Epoch 47/50, Loss: 0.0248, elapsed time: 186.31 sec
Epoch 48/50, Loss: 0.0245, elapsed time: 185.64 sec
Epoch 49/50, Loss: 0.0243, elapsed time: 188.40 sec
Epoch 50/50, Loss: 0.0239, elapsed time: 185.36 sec
Total time elapsed: 10362.19 sec, 172.70 minutes
No description has been provided for this image
No description has been provided for this image

¶

¶

Project Summary¶


Overview¶

This project aimed to develop a highly accurate model for detecting malicious URLs using both machine learning (ML) and deep learning techniques. The workflow included data preprocessing, feature extraction, model training, and evaluation, culminating in excellent results.

Key Achievements¶

  1. Feature Extraction:

    • Key features such as URL length, special character counts, suspicious keywords, and n-gram patterns significantly contributed to classification accuracy.
  2. Model Performance:

    • Traditional models (XGBoost, LightGBM, CatBoost) performed well across feature sets, while deep learning models, especially the BERT-based FNN, achieved the highest overall accuracy of 98%. The model excelled at identifying Benign URLs with 99% accuracy, which was crucial to the primary goal of distinguishing between benign and malicious URLs effectively.
  3. Result:

    • The BERT-based FNN model demonstrated the best results, achieving 99% accuracy for detecting malicious URLs.

Future Directions¶

  • System Integration: Incorporate the model into a security framework, cross-referencing new URLs against a database of known safe URLs.
  • Continuous Monitoring: Utilize external tools like "VirusTotal" to enhance URL safety verification.
  • Ongoing Data Collection: Regularly collect new data to adapt to emerging threats, with scheduled re-training every 6–12 months to maintain high accuracy as hacking methods evolve.

Conclusion¶

Deep learning, particularly the BERT-based FNN, is highly effective for malicious URL detection, providing a robust "second firewall" in cybersecurity defenses that operates independently of specific systems. The model’s strength in accurately classifying benign URLs demonstrates its value in differentiating malicious from safe URLs, fulfilling the project’s core objective.